OVH PROXMOX SOFTWARE RAID - RECOVERY HINTS
- Context piece. You have an OVH (So-you-start, Kimsufi, etc) rental server.
- Running proxmox linux, installed via template that OVH provided
- with at least 2 hard drives in a vanilla SW Raid config.
- stock raid setup which OVH template has provided
- one drive - in this case - SDA first drive - is removed and replaced because it is failing/bad
- endgame, after drive replace is done, system won't boot any longer. Sounds like grub wasn't installed to SDB by the template, or that the BIOS of the motherboard is not interested in trying to boot drive other than SDA? Can't really tell.
- Solution hints are below
- note it is recommended to do 'soft reboots' from inside the host, and NOT by doing 'reboot' via the admin panel of OVH
- note also you will need to flip it back to boot from hard disk / if it was set into 'boot from net rescue mode'.
Concise hints
- System is booted in rescue mode
- clone disk layout from good SDB to new SDA drive
- add root volume SDA slice into root MD raid device, let it sync up
- optional? did this as debug step, not sure if required. use DD to clone SDB1 onto SDA1. Tiny non-raid slice.
- once synced, setup chroot environment
- enter chroot environment and install grub
- reboot server. Happy days things boot up.
- start up raid sync for last large PVE data slice and let that grind for a few hours as background job. OK
Detailed work hints.
- Before starting anything much here is what we see. Logged in to root SSH rescue environment on the server.
- Old SDB Drive has disk layout this, hinted from cfdisk output capture:
Disk: /dev/sdb
Size: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Label: gpt, identifier: CA0C7679-A38B-498A-93D1-CFF2BFCD4171
Device Start End Sectors Size Type
>> /dev/sdb1 40 2048 2009 1004.5K BIOS boot
Free space 4096 4095 0 0B
/dev/sdb2 4096 41945087 41940992 20G Linux RAID
/dev/sdb3 41945088 44040191 2095104 1023M Linux swap
/dev/sdb4 44040192 3907018751 3862978560 1.8T Linux RAID
Free space 3907018752 3907029134 10383 5.1M
and RAID HINT: current status of cat /proc/mdstat shows us:
root@rescue:/etc/mdadm# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md2 : active raid1 sdb2[1]
20970432 blocks [2/1] [_U]
md4 : active raid1 sdb4[1]
1931489216 blocks [2/1] [_U]
bitmap: 0/15 pages [0KB], 65536KB chunk
- we are confident SDB is good old disk and SDB is new empty disk. can check with cfdisk /dev/sda to validate if you wish.
- Clone disk layout from SDB to SDA, then randomize ID for SDA. Thus:
sgdisk /dev/sdb -R /dev/sda
sgdisk -G /dev/sda
- Once that is done. We can add in the root slice on SDA into raid and let it sync up. Thus:
root@rescue:/etc/mdadm# mdadm --manage /dev/md2 -a /dev/sda2
mdadm: added /dev/sda2
HAVE A LOOK TO CONFIRM SYNC IS UNDERWAY:
root@rescue:/etc/mdadm# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md2 : active raid1 sda2[2] sdb2[1]
20970432 blocks [2/1] [_U]
[>....................] recovery = 2.4% (519488/20970432) finish=1.9min speed=173162K/sec
md4 : active raid1 sdb4[1]
1931489216 blocks [2/1] [_U]
bitmap: 0/15 pages [0KB], 65536KB chunk
unused devices: <none>
root@rescue:/etc/mdadm#
- Let the sync finish, takes a few minutes
- OPTIONAL (?) STEP: MANUAL dd clone SDB1 onto SDA1, thus:
root@rescue:~# dd if=/dev/sdb1 of=/dev/sda1
2009+0 records in
2009+0 records out
1028608 bytes (1.0 MB) copied, 0.0316076 s, 32.5 MB/s
root@rescue:~#
- Confirm raid sync is done
root@rescue:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md2 : active raid1 sda2[0] sdb2[1]
20970432 blocks [2/2] [UU]
md4 : active raid1 sdb4[1]
1931489216 blocks [2/1] [_U]
bitmap: 0/15 pages [0KB], 65536KB chunk
unused devices: <none>
root@rescue:~#
- make sure /MNT exists, proceed. Hints below
root@rescue:~# cd /mnt
root@rescue:/mnt# ls -la
total 4
drwxr-xr-x 2 root root 4096 May 12 2015 .
drwxr-xr-x 37 root root 400 May 28 09:00 ..
root@rescue:/mnt# cd
SETUP AND ENTER THE CHROOT ENVIRONMENT:
========================================
root@rescue:~# mount /dev/md2 /mnt
root@rescue:~# mount --rbind /dev /mnt/dev
root@rescue:~# mount --rbind /proc /mnt/proc
root@rescue:~# mount --rbind /sys /mnt/sys
root@rescue:~# chroot /mnt bash
root@rescue:/# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md2 20G 3.8G 15G 21% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 0 16G 0% /sys/fs/cgroup
root@rescue:/#
INSTALL GRUB ONTO SDA DRIVE
===========================
root@rescue:/# grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.
root@rescue:/#
- Ok we are done. Get out, unmount, reboot, happy days.
- Final step add in last raid chunk to let sync on data PVE slice raid get up to date. Hints:
add in last raid mirror piece:
mdadm --manage /dev/md4 -a /dev/sda4
thus:
root@ns506XXX:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 18M 3.2G 1% /run
/dev/md2 20G 3.8G 15G 21% /
tmpfs 16G 37M 16G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/mapper/pve-data 1.8T 45G 1.7T 3% /var/lib/vz
/dev/sdc1 1.8T 396G 1.5T 22% /backups
tmpfs 3.2G 0 3.2G 0% /run/user/0
/dev/fuse 30M 16K 30M 1% /etc/pve
root@ns506XXX:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 sdb4[1]
1931489216 blocks [2/1] [_U]
bitmap: 3/15 pages [12KB], 65536KB chunk
md2 : active raid1 sdb2[1] sda2[0]
20970432 blocks [2/2] [UU]
unused devices: <none>
root@ns506XXX:~# mdadm --manage /dev/md4 -a /dev/sda4
mdadm: hot added /dev/sda4
root@ns506XXX:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 sda4[2] sdb4[1]
1931489216 blocks [2/1] [_U]
[>....................] recovery = 0.0% (448768/1931489216) finish=143.4min speed=224384K/sec
bitmap: 3/15 pages [12KB], 65536KB chunk
md2 : active raid1 sdb2[1] sda2[0]
20970432 blocks [2/2] [UU]
unused devices: <none>
root@ns506XXX:~#
ok
status few minutes later. Make sure raid is grinding the build. Yes. Will take a few hours, ok.
root@ns506XXX:~#
root@ns506XXX:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 sda4[2] sdb4[1]
1931489216 blocks [2/1] [_U]
[>....................] recovery = 0.8% (16803328/1931489216) finish=302.6min speed=105427K/sec
bitmap: 3/15 pages [12KB], 65536KB chunk
md2 : active raid1 sdb2[1] sda2[0]
20970432 blocks [2/2] [UU]
unused devices: <none>
root@ns506XXX:~#