| | | Forum Newbie
       
Group: Forum Members Last Login: 6/8/2009 5:05:27 PM Posts: 8, Visits: 200 |
| Hello again all,
I've got some concerns with WUG 12.3 - and I seem to be getting nowhere with telephone support. I spent almost 4 hours on the phone with them yesterday and got no resolution or answers to the problems I seem to be having.
First off - a little about the WUG server:
Windows 2k3 SP2
Quad Xeon 1.89 GHz
4 Gigs of RAM
Monitoring 430 devices via SNMP and WMI.
The problems that I am seeing:
False Positives
Almost nightly, there are numerous false positives hitting my inbox, and they seem to hit all at once. For example, I am monitoring remote locations with a simple ping and a WMI check to see how large a file is on a target PC - the ping is fine, but the WMI check seems to fail at the same time for multiple remote locations. This is quite problematic as we need to know if this specific file hits a threshold or not - and we need it to be accurate.
What I've done to try and mitigate this is take recommendations from users on these forums who seem to be having the same problems - staggered polling intervals, increased the timeout and retries, and verified the RPCPingTimeout registry settings.
Another interesting item to this - I set these devices to a maintenance schedule over the night - ending at 7am Pacific Time - once that maintenance schedule was over - WUG started sending out notifications to me about sites down - but there were no actions applied to the devices.
Devices going down, but showing as down for 20 minutes

This is an odd one - a device will go down, but then send an notification that it's been down for 20 minutes, when in actual fact - it may not have even hit the threshold of when the notification should go out. I can't figure out why this is happening, nor can the telephone support staff. Please check out the picture above to get an idea of what I am referring to. The check shows as red, down for 20 minutes but the duration is 1m.
Action Policies removed from devices, still send notifications
A great example of this is yesterday. I removed Action Policies on the devices I was getting false positives on, yet still 4+ hours later I was getting notifications in my inbox that there was a problem. My theory is that WUG is queuing up actions and not processing them quick enough. Sort of related to this theory, is we will see notifications come in very delayed. WUG telephone support could offer me no way to check what is spooled/queued in WUG as far as actions, polls, etc and I can't figure a way determine that. Going to the console and checking out Tools --> Running Actions gives me nothing.
What I would like from Ipswitch support, or other forum members are any theories as to what the issues are I am seeing. In it's current state - it's not accurate enough to be usable. I would also like some insight into where I can look for bottlenecks, and how I can troubleshoot the issues so that WUG is back in a usable state.
Is there a way to check which actions are running and what is queued up?
I've got a ticket still open with WUG to get a tier 2/3 technician to call me back - but I have yet to hear from them. WUG Ticket number 905-82-337
Any help would greatly be appreciated from Ipswitch and knowledgeable forum members who may have experienced the same things.
Cheers,
Graeme Foote
*edit - looks like my spelling this morning was awesome. FALSE not FASLE ;)
Edited: 12/30/2008 5:30:56 PM by GraemeFoote |
| | | | Forum Member
       
Group: Forum Members Last Login: 1/8/2009 3:54:29 PM Posts: 44, Visits: 53 |
| | Hi, I'm a long time user of WhatsUp, but only visit the forums when I'm having an issue or trying something new. I was running v12.0.1 (?) on Windows 2003 on a bit of an oldish HP 1RU server, monitoring around 170 devices (mixture of Windows, Cisco, WAN). All going fine, then a disk in the raid decided to die, it was out of warranty, so decided to move to a Dell 2850 that I had lying around. Built as 2008 x64, then found WhatsUp not supported on that, so built at 2008 x32. Installed a fresh copy of WhatsUp (local DB), updated old server to latest WhatsUp, backed up database, deactivated license, activated license on new server, imported database. Swapped the IP addresses (for SysLog traps, etc). Added my modem for SMS alerts. It seemed to be struggling. Installed SQL Manager, adjusted maximum memory to 1GB as SQL has a habit of grabbing all it can and leaving nothing for anything else. Seemed to then be working fine. Well almost. I'm getting a few false positives on disk space. Noticed someone has posted about a script that checks all disks on a server. I didn't know about that. I use the "Exchange Monitor" and have an active monitor for each disk I want to check (ie "Drive C < 3GB", "Drive D < 3GB", etc). I have about 40 Windows servers that I poll ever 120s and typical active monitors are ping, snmp ping, Drive C < 3GB, Drive D < 3GB, and for just a few servers I monitor cpu, disk, memory. Every now and again, usually on just one server (but the server changes), just one of the drives comes back as being low on space. The next poll its back (and there is no way it actually went low on space). So from what was working to now with false positives I changed from Windows 2003 to 2008, and upgraded from 12.0.1 to 12.3.1. |
| | | | Forum Newbie
       
Group: Forum Members Last Login: 6/8/2009 5:05:27 PM Posts: 8, Visits: 200 |
| | | | | Forum Member
       
Group: Forum Members Last Login: 1/8/2009 3:54:29 PM Posts: 44, Visits: 53 |
| | I'm still getting false positivies on disk space, approximately one per day. Today it was low disk space on a C: drive where the Disk Utilization report shows it is 30GB in size and has stayed at 73% usage all day. |
| | | | Forum Guru
       
Group: Forum Members Last Login: 6/26/2009 8:28:52 AM Posts: 65, Visits: 601 |
| GraemeFoote (12/30/2008) Please check out the picture above to get an idea of what I am referring to. The check shows as red, down for 20 minutes but the duration is 1m.
FYI, the duration is the amount of time the device is in that state, so if the monitor is "Down 20 minutes" and the duration is 1 minutes, then that means the device has actually been "down" for 21 minutes. Also, a timeout is treated as a negative result, which I don't like. So if your disk space monitor shows as being "down", that could mean that your disk space is below the threshold you set, or the monitor timed out. |
| | | | Forum Newbie
       
Group: Forum Members Last Login: 6/8/2009 5:05:27 PM Posts: 8, Visits: 200 |
| Right - but check out this screenshot also:
Down for 20 minutes, but duration is 12hrs. Incidentally - the VPN tunnel is in fact *not* down.
|
| | | | 
Time Traveler
       
Group: WhatsUp Gold Expert Last Login: Yesterday @ 10:09:57 AM Posts: 1,593, Visits: 7,686 |
| Your screenshot shows that the result of the poll was 0, and that it's not the value you expect. Dig you dig into this a bit more ? What is the value you normally expect ?
Reading, writing and arithmetic - If you need to choose, please take option 1. |
| | | | Forum Newbie
       
Group: Forum Members Last Login: 6/8/2009 5:05:27 PM Posts: 8, Visits: 200 |
| The polled value is not the problem - the device went down at 8:32pm - duration 1m. By 8:51pm it had be 'down' for 12 hours.
|
| | | | 
Time Traveler
       
Group: WhatsUp Gold Expert Last Login: Yesterday @ 10:09:57 AM Posts: 1,593, Visits: 7,686 |
| | Sorry, I though it was the issue because you said "Incidentally - the VPN tunnel is in fact *not* down.". Now I understand, and what you see is perfectly logical (at least in the Wug way). In wug, you have pre-built device states : Down, Down at least 5mn, 20mn, and so on. If you wish, you can create additional states, something like "Down at least 60mn" for instance. In your current situation, the longest device state you have is "Down at least 20mn". So Wug says exactly that : the device has been down for AT LEAST 20mn, that is the device state, and the duration tells you that it's been 12h43m since the device has been in that state. So, the "AT LEAST" is important here, and I hope you now see better what it means !
Reading, writing and arithmetic - If you need to choose, please take option 1. |
| | | | 
Time Traveler
       
Group: WhatsUp Gold Expert Last Login: Yesterday @ 10:09:57 AM Posts: 1,593, Visits: 7,686 |
| | Ok, SORRY, forget what I said, I re-read your post and what I said is still not your problem. But now I imagine the following, and it has to deal with the defined polling frequency for your monitor. I might be wrong, but it's worth checking. Let's say that you scheduled the active monitor to poll every 30mn. So, at 10am, the monitor was up. Now, at 10.30am, Wug polls again, and the monitor is down. Because the polling interval is higher than your highest "down" status, Wug reports the highest down status it knows... Which makes sense in some way, because if wug polls every 30 minutes, it cannot honestly say if the monitor went down 1 minute or 10 or 18 minutes ago. Now, for the false positives. Most of the times it will happen if : a/ You are polling devices over a slow wan link => in that case, simply raise the timeout and retries for your monitor. For instance, I define a "Slow ping" monitor, and when I poll distant devices I use that one instead of normal ping. This helps if you have latency on your link -a ping can quickly be missed. b/ You are using a script that takes too much time to execute. The default timeout for a script is 10secs; although you can raise it to 60secs; it's not really recommended; the better way is to optimise your script (for instance : in a wmi query, refrain from using select * when you only need to retrieve one value). c/ Your wug server OR your target box are overloaded => do some interactive monitoring on them just to check. d/ Some other thing is in the way: a firewall, having trouble with DCOM packets (all windoze networking needs that). Or your DNS servers having an issue. Or your domain controllers not authenticating quick enough. Or your time settings not in sync -Kerberos is picky about that. In other words, since all the polling process (especially all WMI stuff) relies heavily on your whole active directory being in good shape, anything bad happening there will have an impact. Hope it helps this time !
Reading, writing and arithmetic - If you need to choose, please take option 1. |
| |
|
|