CONTEXT

There are 2 ways to poll for the health of RAID in PERC6e attached MD1000 disk arrays (at least that I've been able to determine).

  1. Install some parts of the dell-provided OMSA management suite, so you are able to start OMSA daemons on your system, and then use appropriate commands to poll for raid health status.
  2. Use a command-line tool provided by LSI to poll the information directly from the raid controller card, without using OMSA in any way.

I have setup both of these routes on a system recently. The first was in place, but then didn't appear to be running reliably after reboots / consistently ; so a second line of defense was setup subsequently. It is possible that some folks might prefer to use only the second route, to reduce the 'clutter' of installed software (OMSA is 'not super tiny').

For each of these methods, quick setup notes are provided below.


OMSA APPROACH:

 srvadmin-services.sh start
  • Poll OMSA for a quick report on the health of the Perc6/MD1000 arrays.

[root@storage nov-4-08-perc-monitor-script]# omreport storage vdisk
List of Virtual Disks in the System

Controller PERC 6/E Adapter (Not Available)
ID                  : 0
Status              : Ok
Name                : vd0-array
State               : Ready
Progress            : Not Applicable
Layout              : RAID-6
Size                : 12,103.00 GB (12995497295872 bytes)
Device Name         : /dev/sdb
Type                : SATA
Read Policy         : Adaptive Read Ahead
Write Policy        : Write Back
Cache Policy        : Not Applicable
Stripe Element Size : 128 KB
Disk Cache Policy   : Enabled

ID                  : 1
Status              : Ok
Name                : vd1-array
State               : Ready
Progress            : Not Applicable
Layout              : RAID-6
Size                : 12,103.00 GB (12995497295872 bytes)
Device Name         : /dev/sdc
Type                : SATA
Read Policy         : Adaptive Read Ahead
Write Policy        : Write Back
Cache Policy        : Not Applicable
Stripe Element Size : 128 KB
Disk Cache Policy   : Enabled

ID                  : 2
Status              : Ok
Name                : vd2-array
State               : Ready
Progress            : Not Applicable
Layout              : RAID-6
Size                : 12,103.00 GB (12995497295872 bytes)
Device Name         : /dev/sdd
Type                : SATA
Read Policy         : Adaptive Read Ahead
Write Policy        : Write Back
Cache Policy        : Not Applicable
Stripe Element Size : 128 KB
Disk Cache Policy   : Enabled

ID                  : 3
Status              : Ok
Name                : vd3-array
State               : Ready
Progress            : Not Applicable
Layout              : RAID-6
Size                : 12,103.00 GB (12995497295872 bytes)
Device Name         : /dev/sde
Type                : SATA
Read Policy         : Adaptive Read Ahead
Write Policy        : Write Back
Cache Policy        : Not Applicable
Stripe Element Size : 128 KB
Disk Cache Policy   : Enabled

Controller SAS 6/iR Integrated (Not Available)
ID                  : 0
Status              : Ok
Name                : Virtual Disk 0
State               : Ready
Progress            : Not Applicable
Layout              : RAID-1
Size                : 465.25 GB (499558383616 bytes)
Device Name         : /dev/sda
Type                : SATA
Read Policy         : No Read Ahead
Write Policy        : Write Through
Cache Policy        : Not Applicable
Stripe Element Size : Not Applicable
Disk Cache Policy   :

[root@storage nov-4-08-perc-monitor-script]#

Note in this example, again, we have a single Perc6e controller with 4x Direct-attached MD1000 enclosures, each with 15 x 1tb disks. We also have the internal HW Raid controller reporting the status of its OS disks (Mirrored 500gig SATA disks)


LSI MegaRaid CLI APPROACH:

Work capture / samples are provided below.



Create a file called "analysis.awk" in the /opt/MegaRAID/MegaCli directory with the following contents:

# This is a little AWK program that interprets MegaCLI output

    /Device Id/ { counter += 1; device[counter] = $3 }
    /Firmware state/ { state_drive[counter] = $3 }
    /Inquiry/ { name_drive[counter] = $3 " " $4 " " $5 " " $6 }
    END {
    for (i=1; i<=counter; i+=1) printf ( "Device %02d (%s) status is: %s <br/>\n", device[i], name_drive[i], state_drive[i]); }

Now try it out:


[root@storage bin]# cd /opt/MegaRAID/MegaCli/

[root@storage MegaCli]# ls -la
total 2244
drwxr-xr-x 2 root root    4096 Nov 13 08:50 .
drwxr-xr-x 3 root root    4096 Nov 13 08:50 ..
-rw-r--r-- 1 root root     371 Nov 13 08:50 analysis.awk
-rwxr-xr-x 1 root root 2203560 Aug 18 03:58 MegaCli64
-rw-r--r-- 1 root root   69714 Nov 13 08:51 MegaSAS.log


[root@storage MegaCli]# more analysis.awk

# This is a little AWK program that interprets MegaCLI output

    /Device Id/ { counter += 1; device[counter] = $3 }
    /Firmware state/ { state_drive[counter] = $3 }
    /Inquiry/ { name_drive[counter] = $3 " " $4 " " $5 " " $6 }
    END {
    for (i=1; i<=counter; i+=1) printf ( "Device %02d (%s) status is: %s <br/>\n", device[i], name_drive[i], state_drive[i]); }



[root@storage MegaCli]# ./MegaCli64 -PDList -aALL | awk -f analysis.awk

Device 17 (ATA Hitachi HUA72101A74A GTF000PBGBK50F) status is: Online <br/>
Device 16 (ATA Hitachi HUA72101A74A GTF000PBGBK7YF) status is: Online <br/>
Device 15 (ATA Hitachi HUA72101A74A GTF000PBGBK77F) status is: Online <br/>
Device 14 (ATA Hitachi HUA72101A74A GTF000PBGBK73F) status is: Online <br/>
Device 13 (ATA Hitachi HUA72101A74A GTF000PBG59D7F) status is: Online <br/>
Device 12 (ATA Hitachi HUA72101A74A GTF000PBGBNEHF) status is: Online <br/>
Device 11 (ATA Hitachi HUA72101A74A GTF000PBGB6J7F) status is: Online <br/>
Device 10 (ATA Hitachi HUA72101A74A GTF000PBGBB76F) status is: Online <br/>
Device 09 (ATA Hitachi HUA72101A74A GTF000PBGB71XF) status is: Online <br/>
Device 08 (ATA Hitachi HUA72101A74A GTF000PBGBMV3F) status is: Online <br/>
Device 07 (ATA Hitachi HUA72101A74A GTF000PBGBKERF) status is: Online <br/>
Device 06 (ATA Hitachi HUA72101A74A GTF000PBGBK7DF) status is: Online <br/>
Device 05 (ATA Hitachi HUA72101A74A GTF000PBGBK6TF) status is: Online <br/>
Device 04 (ATA Hitachi HUA72101A74A GTF000PBGBK6NF) status is: Online <br/>
Device 03 (ATA Hitachi HUA72101A74A GTF000PBGB6DAF) status is: Online <br/>
Device 33 (ATA Hitachi HUA72101A74A GTF000PBGBK74F) status is: Online <br/>
Device 32 (ATA Hitachi HUA72101A74A GTF000PBGBK52F) status is: Online <br/>
Device 31 (ATA Hitachi HUA72101A74A GTF000PBGBT67F) status is: Online <br/>
Device 30 (ATA Hitachi HUA72101A74A GTF000PBGBRWMF) status is: Online <br/>
Device 29 (ATA Hitachi HUA72101A74A GTF000PBGBRBHF) status is: Online <br/>
Device 28 (ATA Hitachi HUA72101A74A GTF000PBGBA53F) status is: Online <br/>
Device 27 (ATA Hitachi HUA72101A74A GTF000PBGBNEDF) status is: Online <br/>
Device 26 (ATA Hitachi HUA72101A74A GTF000PBGBNDTF) status is: Online <br/>
Device 25 (ATA Hitachi HUA72101A74A GTF000PBGBKE6F) status is: Online <br/>
Device 24 (ATA Hitachi HUA72101A74A GTF000PBGBNDEF) status is: Online <br/>
Device 23 (ATA Hitachi HUA72101A74A GTF000PBGB3KTF) status is: Online <br/>
Device 22 (ATA Hitachi HUA72101A74A GTF000PBGBM8NF) status is: Online <br/>
Device 21 (ATA Hitachi HUA72101A74A GTF000PBGBS3GF) status is: Online <br/>
Device 20 (ATA Hitachi HUA72101A74A GTF000PBGBM7HF) status is: Online <br/>
Device 19 (ATA Hitachi HUA72101A74A GTF000PBGBT5SF) status is: Online <br/>
Device 48 (ATA Hitachi HUA72101A74A GTF000PBGAGPUF) status is: Online <br/>
Device 47 (ATA Hitachi HUA72101A74A GTF000PBGB4R5F) status is: Online <br/>
Device 46 (ATA Hitachi HUA72101A74A GTF000PBGAR0HF) status is: Online <br/>
Device 45 (ATA Hitachi HUA72101A74A GTF000PBGB1XZF) status is: Online <br/>
Device 44 (ATA Hitachi HUA72101A74A GTF000PBGB6HTF) status is: Online <br/>
Device 43 (ATA Hitachi HUA72101A74A GTF000PBGB6JHF) status is: Online <br/>
Device 42 (ATA Hitachi HUA72101A74A GTF000PBGB3H7F) status is: Online <br/>
Device 41 (ATA Hitachi HUA72101A74A GTF000PBGAYDPF) status is: Online <br/>
Device 40 (ATA Hitachi HUA72101A74A GTF000PBGAPSBF) status is: Online <br/>
Device 39 (ATA Hitachi HUA72101A74A GTF000PBGB6JSF) status is: Online <br/>
Device 38 (ATA Hitachi HUA72101A74A GTF000PBGB2WWF) status is: Online <br/>
Device 37 (ATA Hitachi HUA72101A74A GTF000PBGAGR8F) status is: Online <br/>
Device 36 (ATA Hitachi HUA72101A74A GTF000PBGAGS7F) status is: Online <br/>
Device 35 (ATA Hitachi HUA72101A74A GTF000PBGB4JRF) status is: Online <br/>
Device 34 (ATA Hitachi HUA72101A74A GTF000PBGB7KGF) status is: Online <br/>
Device 63 (ATA Hitachi HUA72101A74A GTF000PBGBNTMF) status is: Online <br/>
Device 62 (ATA Hitachi HUA72101A74A GTF000PBGB6B5F) status is: Online <br/>
Device 61 (ATA Hitachi HUA72101A74A GTF000PBGB6JDF) status is: Online <br/>
Device 60 (ATA Hitachi HUA72101A74A GTF000PBGBNN0F) status is: Online <br/>
Device 59 (ATA Hitachi HUA72101A74A GTF000PBGB6E8F) status is: Online <br/>
Device 58 (ATA Hitachi HUA72101A74A GTF000PBGB69UF) status is: Online <br/>
Device 57 (ATA Hitachi HUA72101A74A GTF000PBGB2GKF) status is: Online <br/>
Device 56 (ATA Hitachi HUA72101A74A GTF000PBGAW3KF) status is: Online <br/>
Device 55 (ATA Hitachi HUA72101A74A GTF000PBGB6AUF) status is: Online <br/>
Device 54 (ATA Hitachi HUA72101A74A GTF000PBGAW3RF) status is: Online <br/>
Device 53 (ATA Hitachi HUA72101A74A GTF000PBGB6HSF) status is: Online <br/>
Device 52 (ATA Hitachi HUA72101A74A GTF000PBGB690F) status is: Online <br/>
Device 51 (ATA Hitachi HUA72101A74A GTF000PBGBBAVF) status is: Online <br/>
Device 50 (ATA Hitachi HUA72101A74A GTF000PBGB69GF) status is: Online <br/>
Device 49 (ATA Hitachi HUA72101A74A GTF000PBGB69MF) status is: Online <br/>

[root@storage MegaCli]#


Note in this example, we have 4 x MD1000 direct-attached to a single perc6e controller, with 15x 1tb sata disks per enclosure.

A little cron callable script to poll this and notify if problems, might be something trivial like this:


[root@storage bin]# more megaraid-health-check-script
#
# Simple script to use MEGARAID utility, which is from LSI - horrid command line thing
# but possibly a bit more reliable way to poll status of raid devices
# rather than depend on dell OMA based daemons and whatnot.
#
# So, this is a 'second line of defence' to check raid health
#
# This script setup as per hints derived from,
# http://www.bxtra.net/articles/2008-09-16/Dell-Perc6i-RAID-Monitoring-Script-using-MegaCli-LSI-CentOS-52-64-bits
#
# Note that logic of grep had to be tweaked to work for output here. But it seems to do what we want, so that is 'good'
#
# TDC nov-13-08
#
###########################
/opt/MegaRAID/MegaCli/MegaCli64 -PdList -aALL | awk -f /opt/MegaRAID/MegaCli/analysis.awk | grep -v "Online" > /dev/null && echo "Warning: Perc6-MD1000 RAID status no longer optimal"