Software RAID and LVM story

Posting this in English so it can help a few more people out there. It will also bore my usual readers so if you want to read the full story I'm using the fold feature I'm usually not very fond of.

A new client Called me in for a rescue mission. an old build server has crashed after a move and they can't boot it up, they want me to install it from scratch and then see what can be saved from the old data. "It's a RAID1 of two disks sized 250GB, so it looks like 500 GB", they tell me on the phone. I correct them but they are very insistent that the boot screens say RAID1 and the crashed disk was a 500GB partition. I had to see it with my own eyes…

The machine was an Intel server. If you never saw one, they are the nice heavy-duty, generic-looking, no-logo boxes you can get for pretty cheap in Israel. they come with an Intel server board and an Intel BIOS (the horror! EFI shell on a non-Itanium machine!). The hardware RAID was indeed set up to have the two present disks bind up as mirror, but one you boot linux it still showed two disks. Surprised? I'm not. for the last 8 years it's been rare to find a true RAID controller, most of them are "software assisted" raid, which is to say there's no way you'd know the controller is helping you mirror unless you access it though a non-standard API that usually isn't implemented in Linux kernel drivers.

LESSON 1: Always check your installation screens when it's time to set up the partitions. Are you seeing the number of disks and disk sizes you expected?

The guy who installed the original server was a software guy. he didn't bother with the small print, he let Red Hat recommend him stuff, and somehow he decided to pick a small (100MB) partition on /dev/sda for /boot, then the rest of the disk plus all of /dev/sdb were marked as PVs for LVM and one big LV was defined across them. I didn't even bother to look if it was built as striping or linear, but the end result was clear: one of the two disks died and therefore the entire filesystem was useless.

LESSON 2: Never ever define RAID0, LINEAR, lvm2-VG with multiple PV or any other ways of spanning filesystems, unless your underlying block devices are on RAID1-5-6 or the like.

At this point I pulled the disks out, installed a pair of new 1TB SATA disks (raid-grade of course), disabled the RAID device in the bios and tried to install. The disks were nowhere to be found, it took me a few minutes to figure out I had a RAID bios that would not show me disks that were not set up in a RAID, but WOULD show me both disks if I tried to bundle them in a RAID1. uber-smart, eh? I went into the main bios instead and kicked out the RAID firmware loader altogether and oly then I could see the disks directly. I then proceeded to make a small /boot on /dev/md0 (a mirror of sda1 and sdb1) and 300GB /dev/md1 (mirror of sda2 and sdb2), I didn't want the long synch of the entire 1TB disk, and decided I'll resize it later. silly me, I have no clue what I was thinking (I must have had a really smart answer to that at the time but I can't remember what it was).

After the system was set up, I decided it was finally time to add the full disk space to /dev/md1. resizing /dev/sd[ab]2 was no biggy, I knew the partition table had to be re-read and therefore I'll need a reboot. Little did I know that mdadm would not reassemble them automatically anymore! the system came up with no /dev/md1 but was working fine! the lvm found the VG, but complained sda2 and sdb2 are both PVs beloging to the VG, but have the same UUID, so it's only using sdb2…

LESSON 3: MD resizing may be a cool trick at parties, but don't try it for real, especially not with LVM on top… at least not when doing all of this remotely via SSH…

So now what? Back everything up and reinstall? Nope, wait, there's a neat trick I wanted to try. like a magician pulling the tablecloth from under the dinner table. I asked lvm to kindly let go of /dev/sda2 (after making sure all the I/O of the LVs are indeed happening on the other disk). note the way to use it is run vgreduce and NOT PVREMOVE! in this case lvm was noting sda2 had the save PV UUID as sdb2 so it was ignoring it, and vgreduce was not needed. I ran it anyway to be on the safe side and then happy with the results I then ran pvremove and thought my road was open, and tried to create a degraded RAID1 with only sda2, however mdadm came back with "mdadm: no raid-disks specified." even though sda2 is clean and the partition is type fd…

I decided to zero the partition, ran dd /dev/sda2 for a few seconds and hit ctrl-C (later I found out I could do the same with mdadm --zero-superblock /dev/sda2) and went on to create a degraded RAID:
~ # mdadm -C /dev/md1 -n 2 -l 1 missing /dev/sda2
mdadm: array /dev/md1 started.

Everyone with me so far? /dev/sdb2 is a 300GB PV on a 990GB partition and 5 LVs on it, MD1 is a RAID1 with only one disk in it. now here comes the tricky part…

I made md1 into a PV, added it to the VG, and used:
~ # vgextend vgblah /dev/md1
Volume group "vgblah" successfully extended
~ # pvmove /dev/sdb2 /dev/md1

(now go have lunch… it take a while to move 300GB)

The fun part is that the filesystems on top of the VG are alive and well :-) you can do this on a live server. probably not a great idea, but it works…
Once it's done, it's time to remove the /dev/sdb2 from the VG, remove it as a PV, zero the superblock and add it to the MD1 instead, and let it rebuild:
~ # vgreduce vgblah /dev/sdb2
Removed "/dev/sdb2" from volume group "vgblah"
~ # mdadm --add /dev/md1 /dev/sdb2
mdadm: added /dev/sdb2

Is that all? I was still afraid the machine would boot up and fail to see the RAID1 yet again. The trick is of course to let the initrd assemble it, so just to make sure, compare your /etc/mdadm.conf with what's really on the system:
~ # mdadm -D -s
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=aa30240a:1899839f:70bb2eaf:9199ff7f
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=3bec2b13:aa407875:4e9cbe52:c79b59d6
~ # cat /etc/mdadm.conf
# mdadm.conf written out by anaconda
DEVICE partitions
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=3bec2b13:aa407875:4e9cbe52:c79b59d6
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=a9ffb64c:52df17a8:31b44db1:a2eb4dda

Depending on your initrd system, this means the assembling would fail (happend to me on Debian once) so I fixed the ID in the config and rebuilt the initrd, just is case:
~ # mkinitrd -f /boot/initrd-2.6.18-92.1.22.el5.centos.plusPAE.img 2.6.18-92.1.22.el5.centos.plusPAE
(if you are using lilo instead of GRUB, this is a time to run it)

Now cross your fingers and reboot if you must :)

4 תגובות בנושא “Software RAID and LVM story”

  1. פוסט יפה על נושא מענין.

    נשמע קטע מטורף הסיפור עם בקר הRAID. בשביל מה, לכל הרוחות, לוקחים בקר RAID אם לא בשביל שהוא יסתיר מאיתנו את כל הסיפור וידאג לזה בעצמו?

    1. 1. שיעשה XOR מהיר אם עושים RAID4-5-6

      2. צריך לדעת לבקש את הcache. הבקרים היקרים מגיעים עם טונה של זכרון RAM וסוללות גיבוי על גבי הכרטיס, ומחיר של שרת על הבקר לבדו. הזולים מגיעים רק עם קאש בסיסי אם בכלל.

      באופן כללי, רוב בקרי הרייד הם עלה תאנה למי שמערכת ההפעלה שלו לא עושה RAID כראוי בתוכנה.

      דוגמא נוספת היא שזוג דיסקים במצב מראה בתוכנה אליבא דה מיקיסופט משמעו שהמידע נכתב לA וגם B אבל נקרא רק מA. בלינוקס הוא נקרא גם מA וגם מB לפי מה שיותר חכם באותו הרגע (אופטימיזציה משוגעת של אלגוריתם המעלית), מה שגורם לזה שהקריאה מזוג דיסקים היא 100% יותר מהירה מדיסק בודד, ו200% יותר מהיר לעשות מירור משולש (מצב שלא נתמך בכל הבקרים אפילו).

      1. החכמתי (שוב).

        בכל מקום שמסבירים על RAID ברמה בסיסית, למשל בספר "המדריך לטכנאי PC" או משהו כזה, מסבירים שRAID 1 מאט את הכתיבה (כתיבה כפולה) אבל מאיץ את הקריאה (קריאה לסירוגין מהדיסקים). מסתבר שזה לא נכון, או שהקריאה רק מאחד (MS) ואין שום האצה או שהקריאה באמת מואצת אבל היא לא בדיוק נעשית ע"י קריאה לסירוגין. כנראה שבמיקרוסופט שכחו לקרוא את הספר.

        משעשע הכיתוב למטה על זה שאתה זוכר אותי.

        1. עברו מאז הימים… כשאתה כותב בחומרה מודרנית לדיסק, הדרייבר בקרנל עושה את הדבר החכם: כותב פעם אחת לDMA ואז מורה לבקר לשלוח את אותו המידע פעמיים-שלוש לכל הדיסקים שצריך. מה שמאיט זה כשאתה מחשב XOR של מידע במצב שאינו מירור (כלומר RAID4-5-6) ולמעשה קורא מידע מהדיסק כדי לחשב מה לכתוב חזרה. מחיר הדיסקים היום לא מצדיק המשך עבודה במצב כזה.

Leave a Reply