Assorted-Reference.Pe1950-linux-topics-os-disk-raid-health-monitoring History

Hide minor edits - Show changes to markup

November 16, 2009, at 02:04 PM by 142.177.26.251 -
Changed lines 54-55 from:
to:
  • Note that the file /etc/rc.modules needs to have execute bit set, ie, chmod 700 for that file, in order for it to run-load at boot properly. (otherwise this modprobe does not actually happen when system is booted).
November 17, 2008, at 07:51 PM by 142.68.237.135 -
Changed lines 3-4 from:

However, to actually monitor the health of the raid array isn't instantly trivial, and dell provided tools under linux (OMSA suite) don't appear to poll the integrated raid card for health; only add-in card like Perc6e seemed to generate output as desired.

to:

However, to actually monitor the health of the raid array isn't instantly trivial, and while dell provided tools under linux (OMSA suite) do poll the integrated raid card for health; in some circumstances you may not want to install OMSA to be able to monitor internal raid health. (In my case, I didn't want to install OMSA to all 25 compute nodes in a compute cluster to be able to poll the health of the OS hardware raid mirrors)

November 17, 2008, at 07:34 PM by 142.68.237.135 -
Changed lines 72-73 from:
  1. Note there are 1 nodes with 2 x WDC HDDs per unit, for 2 drives in total expected to report.

DRIVES=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c WDC`

to:
  1. Note there are 1 nodes with 2 x ST3500 HDDs per unit, for 2 drives in total expected to report.

DRIVES=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c ST3500`

November 17, 2008, at 07:33 PM by 142.68.237.135 -
Added lines 59-88:
  1. !/bin/bash
  2. Small script called by cron nightly to poll for health
  3. of the local OS Mirror raid arrays, and notify if things are amiss.
  4. TDC Nov-13-08

HOSTNAME=`hostname` mpt-status > /tmp/head-node-raid-health-check-temporary-file

  1. First confirm we have all nodes reporting back some kind of status, else throw error.
  2. Note there are 1 nodes with 2 x WDC HDDs per unit, for 2 drives in total expected to report.

DRIVES=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c WDC`

  1. We also anticipate to have 1 counts of "OPTIMAL" returned, one per raid set / one per system

OPTIMAL=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c OPTIMAL`

  1. Remove the temp file, so that it is not present for next time.

rm /tmp/head-node-raid-health-check-temporary-file

  1. Now, we do some logic to test that all is well in Denmark.

if $OPTIMAL == "1" ? then EXIT_PAINLESSLY="true" else echo "$HOSTNAME reports $DRIVES of 2 expected HDDs reporting on raid health, with $OPTIMAL of 1 raid sets reporting optimal health - PLEASE VERIFY IMMEDIATELY" | mail -s "Possible RAID Errors on $HOSTNAME" systems fi

November 17, 2008, at 07:31 PM by 142.68.237.135 -
Added lines 10-59:
  • Get yourself a copy of the RPM (or compile from source), mpt-status: Available at the URL, http://www.drugphish.ch/~ratz/mpt-status/ or for the RPM specifically, http://www.drugphish.ch/~ratz/mpt-status/RPMS/1.2.0_RC7/mpt-status-1.2.0_RC7-3.i386.rpm
  • install the rpm to your system, "rpm --install mpt-status-1.2.0_RC7-3.i386.rpm"
  • try calling it to see what happens. If it complains you may need to "mknod" - follow the suggestion. If it complains that mltctl module is not loaded, then load the module.

See a capture below of an example of this sequence:

root@box nov-13-08-mptsas-status]# ls -la
total 204
drwxr-xr-x  3 root root   4096 Nov 13 08:40 .
drwxr-xr-x  9 root root   4096 Nov 13 09:08 ..
drwxr-xr-x  6  501  501   4096 Nov 13 08:37 mpt-status-1.2.0
-rw-r--r--  1 root root  27986 Jun 30  2006 mpt-status-1.2.0_RC7-3.i386.rpm
-rw-r--r--  1 root root 153600 Nov  5  2006 mpt-status-1.2.0.tar
-rw-r--r--  1 root root     82 Nov 13 08:40 README.txt
-rw-r--r--  1 root root     65 Nov 13 08:33 src-url

[root@box nov-13-08-mptsas-status]# rpm --install mpt-status-1.2.0_RC7-3.i386.rpm

[root@box nov-13-08-mptsas-status]# mpt-status
open /dev/mptctl: No such file or directory
  Try: mknod /dev/mptctl c 10 220

Make sure mptctl is loaded into the kernel

[root@box nov-13-08-mptsas-status]# mknod /dev/mptctl c 10 220

[root@box nov-13-08-mptsas-status]# mpt-status
open /dev/mptctl: No such device
  Are you sure your controller is supported by mptlinux?
Make sure mptctl is loaded into the kernel

[root@box nov-13-08-mptsas-status]# modprobe mptctl

[root@box nov-13-08-mptsas-status]# mpt-status
ioc0 vol_id 0 type IM, 2 phy, 465 GB, state OPTIMAL, flags ENABLED
ioc0 phy 1 scsi_id 9 ATA      ST3500320NS      MA07, 465 GB, state ONLINE, flags NONE
ioc0 phy 0 scsi_id 1 ATA      ST3500320NS      MA07, 465 GB, state ONLINE, flags NONE

  • now that it works, ensure that the module is loaded each time your system reboots by adding a line to /etc/rc.modules reading "modprobe mptctl"
  • Provided below is a trivial script you could call by cron at regular intervals (nightly?) to poll for health of disks, and notify in case things are needing attention.

November 17, 2008, at 07:24 PM by 142.68.237.135 -
Added lines 1-9:

The pe1950 server has integrated LSI hardware and regular MPT modules present in RHEL4.X/5.X will recognize this hardware / so that installation is trivial (a non-issue)

However, to actually monitor the health of the raid array isn't instantly trivial, and dell provided tools under linux (OMSA suite) don't appear to poll the integrated raid card for health; only add-in card like Perc6e seemed to generate output as desired.

Some digging with google located a 3rd party open source freeware, "mpt-status", which makes use of the mptctl kernel module, to generate a easily human readable report on the status of the internal raid.

This can subsequently be used as a trivial way to generate a basic script that can be called by CRON (say, every night) to notify in case of any non-optimal circumstances.