I keep getting sporadic warnings/false alerts on 2 Dell PowerEdge M1000e Blade Server Chassis. The admin says there aren't any errors on the device. What OIDs are being used for these chassis sensors below AND what table(s) are the results stored in?
- Blade subsystem sensor
- Overall Chassis Status
- Overall Chassis Status as reported by Remote Access Card
- Redundancy sensor
--------------------------------
The hardware health results for the Blade subsystem is currently showing this (at the 1 minute reporting interval). (It's unusual that there haven't been any warnings, but normally there are some for different times than what I'm getting from the UnDPs.)
I added UnDP for OIDs 1.3.6.1.4.1.674.10892.2.3.1.1 thru 9, but they don't agree with the warnings shown on the Hardware Health. I'm polling them every 2 minutes and below are the recent results that are not OK (status 3).
For device#10, I'm seeing results from these as:
8/8/2014 7:38:09 AM | |||||||||||
drsBladeCurrStatus | 386 | Dell DRAC | 4 | nonCritical | 4 | 4 | URL | ||||
drsGlobalCurrStatus | 386 | Dell DRAC | 4 | nonCritical | 4 | 4 | URL | ||||
drsGlobalSystemStatus | 386 | Dell DRAC | 4 | nonCritical | 4 | 4 | URL | ||||
8/8/2014 7:58:09 AM | |||||||||||
drsBladeCurrStatus | 386 | Dell DRAC | 4 | nonCritical | 4 | 4 | URL | ||||
drsGlobalCurrStatus | 386 | Dell DRAC | 4 | nonCritical | 4 | 4 | URL | ||||
drsGlobalSystemStatus | 386 | Dell DRAC | 4 | nonCritical | 4 | 4 | URL | ||||
8/8/2014 8:06:09 AM | |||||||||||
drsBladeCurrStatus | 386 | Dell DRAC | 4 | nonCritical | 4 | 4 | URL | ||||
drsGlobalCurrStatus | 386 | Dell DRAC | 4 | nonCritical | 4 | 4 | URL | ||||
drsGlobalSystemStatus | 386 | Dell DRAC | 4 | nonCritical | 4 | 4 | URL |
Node info:
SysObjectID | 1.3.6.1.4.1.674.10892.2 |
CMC v4.45
I started the troubleshooting section of the manual, but when I attempt to test the SNMP OID for Dell
For Dell: 1.3.6.1.4.1.674.10892.1.300.10.1.8.1
I get that the "OID is not supported". So how does the hardware health work at all? I assume the manual is not correct. Please shed some light. Thank you!!
So far I ...
- Adjusted node polling to from 3 to 2 min. (The Alert triggers when condition > 4 min).
- Adjusted the statistics collection to every 4 minutes (instead of 10) so it doesn't look like the warning was lasting 10 minutes. [It seemed to "average" the 5 statuses together, such as OK, OK, Warning, OK, OK, with a result of Warning; which was making it hard for me to troubleshoot what was really going on.]
- For a different issue, we adjusted the SNMP polling to increase the retry count and increased the timeout. <SnmpProcessProbeSettings BulkSize="10" PollMethod="None" MaxProcessCount="30000" RetryCount="4" Timeout="3500"/>
Server & Application Monitor Administrator Guide : Troubleshooting : Troubleshooting Hardware Health : Troubleshooting an SNMP Node
Orion Platform 2013.2.1, SAM 6.0.2, IPAM 4.1, NCM 7.2.2, NPM 10.6.1, NTA 3.11.0, IVIM 1.9.0