Assorted-Reference.Pe1950-linux-topics-os-disk-raid-health-monitoring History
Show minor edits - Show changes to markup
- Note that the file /etc/rc.modules needs to have execute bit set, ie, chmod 700 for that file, in order for it to run-load at boot properly. (otherwise this modprobe does not actually happen when system is booted).
However, to actually monitor the health of the raid array isn't instantly trivial, and dell provided tools under linux (OMSA suite) don't appear to poll the integrated raid card for health; only add-in card like Perc6e seemed to generate output as desired.
However, to actually monitor the health of the raid array isn't instantly trivial, and while dell provided tools under linux (OMSA suite) do poll the integrated raid card for health; in some circumstances you may not want to install OMSA to be able to monitor internal raid health. (In my case, I didn't want to install OMSA to all 25 compute nodes in a compute cluster to be able to poll the health of the OS hardware raid mirrors)
- Note there are 1 nodes with 2 x WDC HDDs per unit, for 2 drives in total expected to report.
DRIVES=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c WDC`
- Note there are 1 nodes with 2 x ST3500 HDDs per unit, for 2 drives in total expected to report.
DRIVES=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c ST3500`
- !/bin/bash
- Small script called by cron nightly to poll for health
- of the local OS Mirror raid arrays, and notify if things are amiss.
- TDC Nov-13-08
-
HOSTNAME=`hostname` mpt-status > /tmp/head-node-raid-health-check-temporary-file
- First confirm we have all nodes reporting back some kind of status, else throw error.
- Note there are 1 nodes with 2 x WDC HDDs per unit, for 2 drives in total expected to report.
DRIVES=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c WDC`
- We also anticipate to have 1 counts of "OPTIMAL" returned, one per raid set / one per system
OPTIMAL=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c OPTIMAL`
- Remove the temp file, so that it is not present for next time.
rm /tmp/head-node-raid-health-check-temporary-file
- Now, we do some logic to test that all is well in Denmark.
if $OPTIMAL == "1" ? then EXIT_PAINLESSLY="true" else echo "$HOSTNAME reports $DRIVES of 2 expected HDDs reporting on raid health, with $OPTIMAL of 1 raid sets reporting optimal health - PLEASE VERIFY IMMEDIATELY" | mail -s "Possible RAID Errors on $HOSTNAME" systems fi
- Get yourself a copy of the RPM (or compile from source), mpt-status: Available at the URL, http://www.drugphish.ch/~ratz/mpt-status/ or for the RPM specifically, http://www.drugphish.ch/~ratz/mpt-status/RPMS/1.2.0_RC7/mpt-status-1.2.0_RC7-3.i386.rpm
- install the rpm to your system, "rpm --install mpt-status-1.2.0_RC7-3.i386.rpm"
- try calling it to see what happens. If it complains you may need to "mknod" - follow the suggestion. If it complains that mltctl module is not loaded, then load the module.
See a capture below of an example of this sequence:
root@box nov-13-08-mptsas-status]# ls -la total 204 drwxr-xr-x 3 root root 4096 Nov 13 08:40 . drwxr-xr-x 9 root root 4096 Nov 13 09:08 .. drwxr-xr-x 6 501 501 4096 Nov 13 08:37 mpt-status-1.2.0 -rw-r--r-- 1 root root 27986 Jun 30 2006 mpt-status-1.2.0_RC7-3.i386.rpm -rw-r--r-- 1 root root 153600 Nov 5 2006 mpt-status-1.2.0.tar -rw-r--r-- 1 root root 82 Nov 13 08:40 README.txt -rw-r--r-- 1 root root 65 Nov 13 08:33 src-url [root@box nov-13-08-mptsas-status]# rpm --install mpt-status-1.2.0_RC7-3.i386.rpm [root@box nov-13-08-mptsas-status]# mpt-status open /dev/mptctl: No such file or directory Try: mknod /dev/mptctl c 10 220 Make sure mptctl is loaded into the kernel [root@box nov-13-08-mptsas-status]# mknod /dev/mptctl c 10 220 [root@box nov-13-08-mptsas-status]# mpt-status open /dev/mptctl: No such device Are you sure your controller is supported by mptlinux? Make sure mptctl is loaded into the kernel [root@box nov-13-08-mptsas-status]# modprobe mptctl [root@box nov-13-08-mptsas-status]# mpt-status ioc0 vol_id 0 type IM, 2 phy, 465 GB, state OPTIMAL, flags ENABLED ioc0 phy 1 scsi_id 9 ATA ST3500320NS MA07, 465 GB, state ONLINE, flags NONE ioc0 phy 0 scsi_id 1 ATA ST3500320NS MA07, 465 GB, state ONLINE, flags NONE
- now that it works, ensure that the module is loaded each time your system reboots by adding a line to /etc/rc.modules reading "modprobe mptctl"
- Provided below is a trivial script you could call by cron at regular intervals (nightly?) to poll for health of disks, and notify in case things are needing attention.
The pe1950 server has integrated LSI hardware and regular MPT modules present in RHEL4.X/5.X will recognize this hardware / so that installation is trivial (a non-issue)
However, to actually monitor the health of the raid array isn't instantly trivial, and dell provided tools under linux (OMSA suite) don't appear to poll the integrated raid card for health; only add-in card like Perc6e seemed to generate output as desired.
Some digging with google located a 3rd party open source freeware, "mpt-status", which makes use of the mptctl kernel module, to generate a easily human readable report on the status of the internal raid.
This can subsequently be used as a trivial way to generate a basic script that can be called by CRON (say, every night) to notify in case of any non-optimal circumstances.