NFS problem or something else?

2007-12-25 11:50:00

The short answer is it was a network problem. Here is the original

question and then some nitty gritty details.

Help,

  I have a Sparc 1000, Solaris 2.5.1, recommended patches are fairly

current, using NIS.

  About 8:20 this morning users connected to the server through sun

workstations or PC's with Exceed 5 lost their connections. Whatever they

were doing froze. After some initial investigation I rebooted the server

and that did not fix the problem. I can ping the workstations and they can

ping the server. Telnet and rlogin work from a PC/ workstation to the

server. When I fire off exceed from a PC I never get the CDE login

screen. If I reboot a workstation I can log in as root and see my NFS

mounted file systems. I can cd to a NFS directory, if I do a" ls" I get one

message "NFS server earth not responding still trying". No big deal I've

seen that before. I've done a nfs.server stop and start. That doesn't help.

All my nfs deamons are running, mountd,lockd, nfsd, statd. Have I missed

any? There are no errors in the servers messages file.

  On a workstation (Ultra 140, Solaris 2.5.1) I have found this in

the messages file.

inetd[119}: yp_all - RPC clnt_call (transport level) failure: RPC: timed out.

Pings work both ways with name and address. The system crashed on

Friday because of a failure in my UPS, but we seem to have that resolved

and the system was running on the UPS all weekend without any

problems.

Any ideas on what my problem is.

I spent the bulk of the day on the phone with my local tech support, and

then with Sun's tech support, before we found a solution. I never got a

chance to read any replies till this morning but Glenn Satchell hit the nail

on the head.

After we switched the ethernet cable to a another wall port we were

able to re-establish communications with the workstations/PC's. The

network is in another departments hands and the switches are MAG

ATM, 740 for the backbone and 280's for workgroups. When the

switches were reset last night one of them didn't come back and had to

be replaced.

  The conclusion that was arrived at yesterday was that there was a

hub, port, cable problem. Traffic that didn't put a load on the network, like

ping, worked just fine, when we tried to do something a little more

intensive like use NIS or execute a command on NFS file system the

network couldn't handle it.

  Along the long rocky road to an answer, some of what we did was;

nfs.server and nfs.client stop and start, verify that nfs deamons were

running, stop and start yp and verify those deamons were running,

rpcinfo, dfmounts, and ypwhich. Connected to a workstation with

Exceed, telneted workstation to workstation, and switched to another

ethernet card on the server. Everything we looked at on the server

looked good, most of what we looked at or tried on the clients worked.

And no complaints in my servers messages file. I've learned a fair amount

including not to assume the network is healthy because ping and telnet

work.

I'm done rambling here are the answers I received.

>From Glenn Satchell

Sounds like a router, hub or switch may be having a hard time. The

yp_all message is usually an indication that the network failed. Try

resetting your hub(s) or switching ports, etc.

>From Joel Lee

You need to bring the nfs server earth up. If it is, you need to start the

nfsd there. I suppose you are not using automount, right ? If that's the

case, it's natural that your users who uses exceed would probably hang

as well.

I may not have been clear in my original post, I was in a hurry. The NFS

server was always up and the nfs deamons were running. Ralph

from Artur Shnayder

 Try to increase file description on NIS server. You can just add the

following

string to /etc/system:

set rlim_fd_cur=512

Comments

Got something to say?

You must be logged in to post a comment.