Fortech I.T. Solutions - SANDBOX :: Assorted-Reference / Pe1950-linux-topics-os-disk-raid-health-monitoring

Assorted-Reference.Pe1950-linux-topics-os-disk-raid-health-monitoring History

Hide minor edits - Show changes to markup

November 16, 2009, at 02:04 PM by 142.177.26.251 -

Changed lines 54-55 from:

to:

Note that the file /etc/rc.modules needs to have execute bit set, ie, chmod 700 for that file, in order for it to run-load at boot properly. (otherwise this modprobe does not actually happen when system is booted).

Restore

November 17, 2008, at 07:51 PM by 142.68.237.135 -

Changed lines 3-4 from:

However, to actually monitor the health of the raid array isn't instantly trivial, and dell provided tools under linux (OMSA suite) don't appear to poll the integrated raid card for health; only add-in card like Perc6e seemed to generate output as desired.

to:

However, to actually monitor the health of the raid array isn't instantly trivial, and while dell provided tools under linux (OMSA suite) do poll the integrated raid card for health; in some circumstances you may not want to install OMSA to be able to monitor internal raid health. (In my case, I didn't want to install OMSA to all 25 compute nodes in a compute cluster to be able to poll the health of the OS hardware raid mirrors)

Restore

November 17, 2008, at 07:34 PM by 142.68.237.135 -

Changed lines 72-73 from:

Note there are 1 nodes with 2 x WDC HDDs per unit, for 2 drives in total expected to report.

DRIVES=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c WDC`

to:

Note there are 1 nodes with 2 x ST3500 HDDs per unit, for 2 drives in total expected to report.

DRIVES=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c ST3500`

Restore

November 17, 2008, at 07:33 PM by 142.68.237.135 -

Added lines 59-88:

!/bin/bash
Small script called by cron nightly to poll for health
of the local OS Mirror raid arrays, and notify if things are amiss.
TDC Nov-13-08

HOSTNAME=`hostname` mpt-status > /tmp/head-node-raid-health-check-temporary-file

First confirm we have all nodes reporting back some kind of status, else throw error.
Note there are 1 nodes with 2 x WDC HDDs per unit, for 2 drives in total expected to report.

DRIVES=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c WDC`

We also anticipate to have 1 counts of "OPTIMAL" returned, one per raid set / one per system

OPTIMAL=`cat /tmp/head-node-raid-health-check-temporary-file | grep -c OPTIMAL`

Remove the temp file, so that it is not present for next time.

rm /tmp/head-node-raid-health-check-temporary-file

Now, we do some logic to test that all is well in Denmark.

if $OPTIMAL == "1" ? then EXIT_PAINLESSLY="true" else echo "$HOSTNAME reports $DRIVES of 2 expected HDDs reporting on raid health, with $OPTIMAL of 1 raid sets reporting optimal health - PLEASE VERIFY IMMEDIATELY" | mail -s "Possible RAID Errors on $HOSTNAME" systems fi

Restore

November 17, 2008, at 07:31 PM by 142.68.237.135 -

Added lines 10-59:

Get yourself a copy of the RPM (or compile from source), mpt-status: Available at the URL, http://www.drugphish.ch/~ratz/mpt-status/ or for the RPM specifically, http://www.drugphish.ch/~ratz/mpt-status/RPMS/1.2.0_RC7/mpt-status-1.2.0_RC7-3.i386.rpm
install the rpm to your system, "rpm --install mpt-status-1.2.0_RC7-3.i386.rpm"
try calling it to see what happens. If it complains you may need to "mknod" - follow the suggestion. If it complains that mltctl module is not loaded, then load the module.

See a capture below of an example of this sequence:

root@box nov-13-08-mptsas-status]# ls -la
total 204
drwxr-xr-x  3 root root   4096 Nov 13 08:40 .
drwxr-xr-x  9 root root   4096 Nov 13 09:08 ..
drwxr-xr-x  6  501  501   4096 Nov 13 08:37 mpt-status-1.2.0
-rw-r--r--  1 root root  27986 Jun 30  2006 mpt-status-1.2.0_RC7-3.i386.rpm
-rw-r--r--  1 root root 153600 Nov  5  2006 mpt-status-1.2.0.tar
-rw-r--r--  1 root root     82 Nov 13 08:40 README.txt
-rw-r--r--  1 root root     65 Nov 13 08:33 src-url

[root@box nov-13-08-mptsas-status]# rpm --install mpt-status-1.2.0_RC7-3.i386.rpm

[root@box nov-13-08-mptsas-status]# mpt-status
open /dev/mptctl: No such file or directory
  Try: mknod /dev/mptctl c 10 220

Make sure mptctl is loaded into the kernel

[root@box nov-13-08-mptsas-status]# mknod /dev/mptctl c 10 220

[root@box nov-13-08-mptsas-status]# mpt-status
open /dev/mptctl: No such device
  Are you sure your controller is supported by mptlinux?
Make sure mptctl is loaded into the kernel

[root@box nov-13-08-mptsas-status]# modprobe mptctl

[root@box nov-13-08-mptsas-status]# mpt-status
ioc0 vol_id 0 type IM, 2 phy, 465 GB, state OPTIMAL, flags ENABLED
ioc0 phy 1 scsi_id 9 ATA      ST3500320NS      MA07, 465 GB, state ONLINE, flags NONE
ioc0 phy 0 scsi_id 1 ATA      ST3500320NS      MA07, 465 GB, state ONLINE, flags NONE

now that it works, ensure that the module is loaded each time your system reboots by adding a line to /etc/rc.modules reading "modprobe mptctl"
Provided below is a trivial script you could call by cron at regular intervals (nightly?) to poll for health of disks, and notify in case things are needing attention.

Restore

November 17, 2008, at 07:24 PM by 142.68.237.135 -

Added lines 1-9:

The pe1950 server has integrated LSI hardware and regular MPT modules present in RHEL4.X/5.X will recognize this hardware / so that installation is trivial (a non-issue)

Some digging with google located a 3rd party open source freeware, "mpt-status", which makes use of the mptctl kernel module, to generate a easily human readable report on the status of the internal raid.

This can subsequently be used as a trivial way to generate a basic script that can be called by CRON (say, every night) to notify in case of any non-optimal circumstances.

Restore

Fortech I.T. Solutions - SANDBOX

Assorted-Reference.Pe1950-linux-topics-os-disk-raid-health-monitoring History

Menu

Actions

Search