Restoring Corrupted QNAP Volume

Some time ago, I set up a QNAP NAS in my home for our family to store our photos and documents. To protect ourselves from data loss, I configured a RAID 1 raid group on the NAS, and additional cloud backup is also done, classic 1-2-3 rule.

Last month, I discovered that one of the HDD has reached a dangerously high bad block count, so I shopped for a new HDD to replace the broken one. But for some reason, after swapping the slot with the new HDD, the NAS refused to rebuild the raid group automatically, unlike my previous experience.

I assumed there was some issue with detection causing the rebuild to not start automatically, so I decided swap back the old HDD, expecting they get back in sync and then I would click “Replace HDD” to manually kickstart the rebuild process. However, the NAS found that the reinserted HDD contains bad blocks and refuse to rebuild the array

Even worse, during the shuffling process, the NAS seems to have corrupted the partition table/metadata and thus make direct attachment impossible and recovery much more difficult.

Situation

  • The HDD is marked as a spare of the RAID 1 group after the reinsert, mdadm would not re-assemble with just the spare
    • mdadm --examine --scan includes a “spare: 1” line
  • After recovering the md array, the LVM2 PV seems to be corrupted and reports no PVs could be found

Important Information

  • QNAP’s storage implementation are based on md array+LVM2
  • The LVM2 implementation used by QNAP is customized, hence a QNAP device is required to mount the LVM2 LV to complete the recovery

Steps

Stage 1: md array recovery

The HDD is first plugged into a Kali VM via USB-to-SATA adapter for first stage recovery. (At the time not knowing QNAP has custom LVM2 implementation)
The partition in question is sdb3 on my VM.

Sidenote: My SATA adapter is running on USB3, after bridging the adapter into the VM, lsusb was not still displaying the adapter, turns out the VM USB compatiability mode needs to set to USB 3.x in order for the guest OS to properly pick up a USB3 device.

# Install the necessary packages
apt-get update
apt-get install mdadm lvm2
mdadm --examine --scan >> /etc/mdadm/mdadm.conf
reboot

# Attempt to simply reassemble the md first. Though this did not work in my case due to the "spare" status of the HDD
mdadm -AfR /dev/md2 /dev/sdb3

# Ideally you should use dd to backup the partition first, but I do not have any storage large enough to hold a dd image,
# and in worst case I have the cloud backup to restore from so I skipped this step
# dd if=/dev/sdb3 of=~/sdb3.img

# Obtain the metadata version of the md array
# (Look for the one for /dev/md1, which is the md ID in the NAS as well)
mdadm --examine --scan

# [**The dangerous step**] Recreate the RAID 1 md array with 1 device only
# The -e metadata version should use what was shown above, for me it was 1.0
# --force is needed to create single-device RAID 1 array
mdadm --create --verbose /dev/md1 --level=1 --raid-devices=1 -e X.X --assume-clean --force /dev/sdXN  # sdb3 in our case

# By now the md array should be available again (in inactive state)
cat /proc/mdstat

# Run the md array if the mdstat is good
mdadm --run /dev/md1

# Ensures the recovery is successful by running pvscan, this should indicate presence of a PV,
# but will also complain about invalid VG type, which is okay because the VG format is QNAP proprietary
pvscan

Stage 2: Obtain LVM2 config backup

Luckily LVM2 automatically keeps copy of the volume group (VG) config, try logging into the NAS and look info /mnt/HDA_ROOT/.config/lvm/archive and see if there is any backup config old enough (i.e. while dates back to when the NAS was still working as expected with the broken HDD in). Otherwise you may be able to recover the LVM2 config from the broken HDD. In my case I have 4 md arrays on sdb[1245],

cat /proc/mdstat
mdadm --run /dev/md13
mdadm --run /dev/md9
mdadm --run /dev/md322
mdadm --run /dev/md256

# Use gparted and see which partitions are actually data partition, for me 322 and 256 are swap partitions
mkdir -p /mnt/md13
mkdir -p /mnt/md9
mount /dev/md13 /mnt/md13
mount /dev/md9 /mnt/md9

ls -l /mnt/md9/.config/lvm/archive  # In my case, the LVM2 config archive is here
# Copy the config
umount /mnt/md13
umount /mnt/md9

Make a copy of the appropriate config file in a safe place, then make another copy and change vg1 to vg2 in the content, and name this new file vg2_restore.vg.

Stage 3: LVM2 and data recovery on QNAP device

Now plug the USB to SATA adapter into the NAS, in my case the broken HDD appears to be /dev/sdc[1-5], with sdc3 being the valuable partition

# Assemble the recovered md array
mdadm --assemble /dev/md2 /dev/sdc3

# See if the PV is recognized already, if yes just skip the steps accordingly, otherwise proceed step by step
pvscan
vgscan
lvscan

# Upload the vg2_restore.vg file to the NAS admin home folder
vi ~/vg2_restore.vg # -> Paste

# 1. Restore PV
# Open the config file exercepted in step 2, note the PV UUID
# Subsititute the PV UUID into the --uuid parameter
pvcreate --test -ff --uuid "XXXXXX-xxxx-xxxx-xxxx-xxxx-xxxx-XXXXXX" --restorefile ~/vg2_restore.vg /dev/md2
# The command should return "command executed successfully" or similar
# In this case, all is good and you can run again without --test

# 2. Restore VG
vgcfgrestore --test --force --file ~/vg2_restore.vg vg2
# The command should return "command executed successfully" or similar
# In this case, all is good and you can run again without --test

# Activate the VG and scan the LVs
vgchange -ay vg2
lvscan

# Mount the LV
ls /dev/mapper  # There should be a file named vg2-lv1
mkdir -p /mnt/recovery
mount /dev/mapper/vg2-lv1 /mnt/recovery/

TADA! The data is now available in /mnt/recovery !

After you are done with the recovery, you can unmount the disk by running:

umount /mnt/recovery/

# The following commands were ran but seems are unncessary
# lvchange -an /dev/vg2/*
# lvremove /dev/vg2  # This might complain about dependency problem, to be safe, answer No and run this command multiple times until all LVs are down
# lvscan  # There should be no /dev/vg2 LV anymore
vgchange -an vg2

# Done, you can now unplug the device