server freezes during backup with Cheyenne ARCserve

2007-12-25 11:56:00

After replacing the system board, all the controllers, and the cables,

and after talking multiple times to tech support from four companies,

and after having the darn thing crash every day for two months, it turns

our what was what was needed was to add a patch that wasn't on the

recommended list. *Sigh* I had gone through the patch list several times

and must have overlooked it every time.

I appreciate the responses I received from

Eugene Kramer <eugene@uniteq.com>

Glenn Satchell <Glenn.Satchell@uniq.com.au>

Cat Okita <cat@uunet.ca>

Thomas FRANK <thomas.frank@magnet.at>

Bismark Espinoza <bismark@alta.Jpl.Nasa.Gov>

My original question:

> SPARC 10,128 MB RAM,Solaris 2.5 w/ recommended patches

>

> Soon after installing Cheyenne ARCserve software to run our Qualstar

> 4210A tape changer, our main NFS server started freezing at random

> places during a full backup or sometimes the next day. During the

> backup, there are no other users on the network and no unusual processes

> running. Strangely, it doesn't crash or show any error messages; it

> just freezes completely. I cannot ping the machine and STOP-A does

> nothing. The only way to recover is cycle the power manually.

>

> Naturally, Sun said it was probably an application problem while

> Cheyenne said it is probably an OS problem. I have syslog logging

> everything from info on up to one huge log, but there are no error

> messages of any kind reported. I have vmstat logging to a typescript

> every five seconds during the backup, and the memory goes down to 1 or

> two MB lots of times during this time, but there is always over 200 MB

> of swap available when it freezes.

>

> I have tested the hardware with some graphical package from Sun(I forget

> what it's called) and everything checked out ok.

>

> Does anyone have any further troublshooting techniques or ideas why the

> machine completely freezes at random times during or after a backup? I

> am pretty certain it has to do with the software since the trouble

> started occuring soon after install and only appears after a backup, but

> how can I confirm this? Does anyone have similar problems with Cheyenne

> ARCserve or can recommend a different software package for running a

> tape changer? Are there any other ways i can probe the OS to isolate

> what is causing the problem?

>

> Thank you.

>

Responses:

-----------------------

I had a problem liek that on a Sparc 10/128M/Solaris 2.5 WITHOUT

Cheyenne.

Turned out that we had a disk drive with crappy SCSI. Our Sparc was a

file server and it usually would hang during release time ( our software

occupies about 500M ) or backup (Networker with 3 simultaneous backup

strams).

Taking the disk out of the picture got rid of the problems.

When system crashed I almost always had a selection light on the faulty

disk.

BTW: disk: Micropolis 9G (old 5 inch format). I've just gotten a new one

from Micropolis, but did not install it yet.

-----------------------

There is a (reasonably well known) bug in the ethernet hardware on the

SS10s. Apparently there isn't enough buffer space devoted to the lance

chip. It was revised in the SS20 which doesn't have this problem. The

symptoms are that the ethernet locks up under heavy load. I've seen

this happen on database servers. Sometimes there's an error message in

/var/adm/messages, othertimes not.

It's a hardware design issue, so there's no software workaround. What

most folks do is to install a sbus ethernet card and use that

interface. The 10MB/s ethernet/scsi cards don't have this problem and

work just fine. Of course, you could always buy one of the 100MB fast

ethernet cards too.

-----------------------

This almost sounds like the ethernet problem again - Sun added in a

patch

(*not* on the recommended/security list, btw) for their ethernet

interfaces

to fix this...

-----------------------

I had maybe the equal problem with Solaris 2.5.1 and Legato Networker on

a SS10.

During the backup or to other times the server frezzed. The

Backup-system was

only installed a few weeks ago. So we thought, it depends on the

Backup-HW

and/or -SW.

But we found the error in changing the system components - power supply,

motherboard and at least the CPU. And it was the CPU !!!

So, if you have a chance to test the Backup-System on an equal machine,

then do

it, or change like I did the system components of the SS10.

-----------------------

Look at cpu load with "vmstat 5", ioload with "iostat 5",

and nfs load iwth repeated "nfsstat -s" .

-----------------------

Comments

Got something to say?

You must be logged in to post a comment.