An anonymous reader writes: I'm having some pain and would like to vent and maybe get some advice.
I joined this startup like 8 months ago. At first we had a cheap PC running Ubuntu with www+ldap+nfs server running. There was a small of centos cluster (TORQUE pbs) running h/w EDA stuff, and a few Ubuntu for s/w folks, all running off that server. For e-mail we decided to use Yahoo small business. All was well.
Few months later it was time to have our own e-mail server so we decided to use exchange server (pretty standard for non-dotcom companies around here). So we got an IT guy. He convinced us to go with Windows Storage Server 2003 and migrate to active directory so that we could have user based access control over NFS, and have centralized authentication for both unix and windows machines. I had my concern about running NFS off windows (symbolic link), but was told that it would work. I disagreed but conformed against my instinct.
First problem that came up before the migration: setgid/setuid doesn't work. MS confirmed that they broke it in the new SFU. This kind of screwed up our original access control plans. We went ahead without this feature.
Second problem that came up: filenames. Windows does not allow certain characters to be in the filename (and unix programs love to use some of these characters). Luckily there was some mapping function in SFU so we mapped the forbidden characters to > 256 (NTFS supports unicode).
After some short tests we scheduled for a night shutdown and went ahead with the migration. My torque PBS nodes went down, but I figured I can fix them later. After all these are just applications! So now we're up running, kind of.
User started to report slow login. It now takes 2-5 seconds to authenticate a user (name resolution is configured properly). So I started running nscd on the centos4.4 machines. This resolved the problem... except that nscd would crash like every two days on average. After some debugging I got this messasge ../../../libraries/liblber/sockbuf.c:91: ber_sockbuf_ctrl: Assertion `( (sb)->sb_opts.lbo_valid == 0x3 )' failed
Interestingly, after I got the TORQUE pbs nodes up running again, pbs_mom crash with similar frequency, with the exact same message!
While the IT guy is frantically installing any new patch releases from MS, I have two choices from the linux side:
1. create watchdog script to restart pbs_mom and nscd daemon
2. upgrade ldap related packages hoping that problem will go away.
3. gdb and start looking at some source codes
So I downloaded latest source of nss_ldap and compiled. Replaced the centos default .so and observed -> didn't fix the problem
Now I downloaded latest stable openldap source. It doesn't even compile when I tried it in the windows NFS mount. After moving it to /tmp all was well. I'm also building pam_ldap.so from latest source. Will report back after a few days.
That's it for me as far as venting is concerned. If anyone ever consider running Linux cluster with Active Directory.. please do me a favor and save the money. MS simply does not work in an unix environment no matter what they say. It was obvious that they did not spend any QA money on it.
If you guys have any tips or suggestions please post.
thanks
p.s. a few other "small problems" with MS's NFS implementation
1. centos "find" binary uses some optimization that does not work with NFS over NTFS. Now we have to use find -noleaf... argGGh!
2. once a while we'd get these weird files in the directory that we can't delete (related to locks probably). The only way to destroy them was to nuke the entire directory by "rm -rf" in the parent. Rebooting the windows server also works.....
3. sometimes (very rarely) files would disappear when you do ls in linux (but if you access them by name directly they'll still there).