Category Archives: Storage

Storage

LVM: The Basics

Published by:

Well, I’m back from the holiday binge, and I’ve brought some shiny new 3D graphics with me! On that tangent, Blender is a pretty cool application, and I wish I had the time to get to know it better.  I’m not a 3D guru, didn’t take any classes in it, and have never used any high dollar computer modeling packages, so don’t take that as a professional endorsement, but for the curious it should prove to be a worthy diversion ( be sure to do the tutorials from their site). Update: Here is a link to the blender files used to create the images in this article.

On to the article.  The purpose here is to give someone who has never had any exposure to LVM, or Logical Volume Manager, a basic understanding of the concepts and how to use it. Later on I hope to dig deeper into the details, such as alignment with RAID volumes, snapshots, and other features.  I do assume some understanding of plain old partitioning, but will brush over that topic as well in order to highlight differences in the process.

Let’s start with the ‘why?’.  Most Linux users will be familiar with the idea of carving up their hard disk into one or more usable containers (partitions) and applying file systems to them, and for casual users, that’s generally sufficient.  The process might go something like this:

  • Start with a raw hard disk. The device name might come up as “/dev/sda”.
  • using ‘fdisk’, or if you’re fancy some graphical partitioning tool, you might assign the first 200 megabytes or so as the first partition on “/dev/sda”, creating “/dev/sda1”. You decide that this will hold your system boot data “/boot”.
  • You then assign the next  one or two gigabytes, creating “/dev/sda2”, to be used as virtual memory, or swap.
  • A casual user might then take the rest of the drive space and create “/dev/sda3” for the operating system data, user data, and everything else. They may instead choose to make more partitions, one just for the operating system, one just for user data, etc, but for the sake of simplicity we’ll stick with three partitions at the moment.
representation of a physical hard disk being partitioned into usable containers

Representation of a physical hard disk being partitioned into usable containers

  • You then create the filesystems on each partition. You can think of this as giving the partition structure or priming it for use.  For example, you’d run ‘mkswap /dev/sda2’ to make the partition we created into swap space, or ‘mkfs.ext2 /dev/sda1’ to allow you to store your boot files on partition sda1 using the ext2 filesystem.
Representation of creating swap space and a boot filesystem out of partitions.

Creating swap space and a boot file system out of partitions.

So here we have a fairly basic, standard partitioning setup, but what does a user do if they find that they’ve filled “/dev/sda3” with their movies?  Not only are they out of space, but their computer is unstable because the system doesn’t have any place to store temporary operating data.  Maybe the user desires to make “/dev/sda3” larger, but there’s no more room on the disk. Their only option is to add another hard drive, “/dev/sdb”, and create a new partition, “/dev/sdb1” to be used exclusively for “/media/movies” (or what have you), but now “/dev/sda3” is mostly an empty, oversized partition.

What would really be handy in this case is to be able to create a partition out of pieces of two different physical drives. We’d want to shrink “/dev/sda3”, use it exclusively for the system data, and then take some free space from “/dev/sda” and “/dev/sdb” and create a partition for user data.

While classic partitions can be resized with certain tools, they’re limited by the physical boundaries of the drive size, as well as the location and size of the other partitions on the drive. For example, if we wanted to grow “/dev/sda3”, we’d need some free space, and it would have to be immediately adjacent to “/dev/sda3”.  This is the basic problem that LVM is designed to solve.

The primary advantage of LVM is that it abstracts physical disk boundaries away from partitions. Instead of physical disks, you now have a pool of storage that can be made up of one, two, three and three quarters disks, or whatever you may have. That pool of storage can be metered out to partitions or taken back from them in small chunks.  Don’t forget that there are other advanced functions that also make LVM useful, but are beyond the scope of this article.

So let’s get started with an overview of LVM in action. We’ve made partitions for “/boot” and swap, but now instead of making “/dev/sda3” into “/” or the root volume, we’re going to take it and a partition from a second installed drive, “/dev/sdb1”, and create what’s called a volume group. This volume group is going to serve as the pool of storage discussed earlier.

The first step is to take our partitions and mark them as physical volumes that LVM can use in a pool. This is a simple process that involves running the ‘pvcreate’ command on each partition that we want to make available. This command, simply put, creates a metadata header on each partition that will store LVM information and allow it to work its magic.  As an aside, it’s not strictly necessary to create partitions, you can ‘pvcreate’ an entire disk if you’d like (i.e. /dev/sdb instead of /dev/sdb1).

root@linux:~# pvcreate /dev/sda3 /dev/sdb1
Physical volume “/dev/sda3” successfully created
Physical volume “/dev/sdb1” successfully created

Marking partitions as physical volumes to be used with LVM

Marking partitions as physical volumes to be used with LVM

Next, we bundle those together into a single volume group.  This is done with the ‘vgcreate’ command. This will write information about the volume group into the metadata header of each physical volume in the group.  It will also create the volume group device, such as “/dev/vg0”. Note that we can add more physical volumes to this volume group at any time, using the ‘vgextend’ command.

root@linux:~# vgcreate vg0 /dev/sda3 /dev/sdb1
Volume group “vg0” successfully created

Creating a volume group from two physical volumes

Creating a volume group from two physical volumes

Now we’ve got our big pool of storage. Notice the grid marks on it. This was my attempt to portray that the volume is divided into small pieces, or physical extents, usually 4 megabytes each by default. These physical extents are the building blocks for logical volumes, which will serve as replacements for our classic partitions.  Creating logical volumes is basically the process of assigning these extents to a defined container. These extents can come from any physical volume in the volume group, it doesn’t really matter (but can optionally be controlled, for example with the contiguous flag), we’re basically just taking pieces from the pool and assigning them to a new logical volume. This process also creates device nodes for us, “/dev/vg0/lv0” and “/dev/vg0/lv1”. Note how the node goes “/dev/<volume group name>/<logical volume name>”.

root@linux:~# lvcreate –extents 5120 –name lv0 /dev/vg0
Logical volume “lv0” created
root@linux:~# lvcreate –extents 20480 –name lv1 /dev/vg0
Logical volume “lv1” created

creating logical volumes from physical extents in pool vg0

creating logical volumes from physical extents in pool vg0

Now we’ve got logical volumes, the LVM equivalent of partitions. Note that I created a 20 gigabyte volume called “lv0” by assigning 5,120 x four megabyte extents, and an 80 gigabyte “lv1” with 20,480 extents. I did this for the sake of the example, in practice you could also use “–size 20G” instead of “–extents 5120”.  Note also that I did not use all of the volume group, there are spare extents on the right waiting to be added to lv0, lv1, or used for a new logical volume.

These new logical volumes can now be treated as normal partitions and formatted with the filesystem of your choice. In this example, we’re going to use the 20 gigabyte volume for the root filesystem “/”, and the larger, 80 gigabyte volume for user data on “/home”.

root@linux:~# mkfs.ext3 /dev/vg0/lv0

root@linux:~# mkfs.ext3 /dev/vg0/lv1

using mkfs.ext3 to format logical volumes

using mkfs.ext3 to format logical volumes

And that’s it for the basics. We’ve covered how to use LVM to create volumes that are a replacement for classic partitions, breaking physical disk barriers. One of the things I like about LVM is the simplicity. The commands are consistent, pvcreate, vgcreate, lvcreate, etc. All you have to do is remember the concepts, physical volumes to volume groups to logical volumes, and you can figure out the commands and what order to do them in.

I’ll leave you with a few examples of how to view the status of your new LVM volumes, as well as expanding a logical volume while online.

Extend logical volume:

root@linux:~# lvextend –size +5G /dev/vg0/lv0
Extending logical volume lv0 to 25.00 GB
Logical volume lv0 successfully resized

root@linux:~# lvdisplay /dev/vg0/lv0
— Logical volume —
LV Name                /dev/vg0/lv0
VG Name                vg0
LV UUID                By4T9J-wPhq-fDYt-JuNE-t3Bc-UoB8-CC6TaV
LV Write Access        read/write
LV Status              available
# open                 1
LV Size                25.00 GB
Current LE             6400
Segments               2
Allocation             inherit
Read ahead sectors     auto
– currently set to     256
Block device           254:0

Resize file system:

root@linux:~# resize2fs /dev/vg0/lv0
resize2fs 1.41.3 (12-Oct-2008)
Filesystem at /dev/vg0/lv0 is mounted on /; on-line resizing required
old desc_blocks = 2, new_desc_blocks = 2
Performing an on-line resize of /dev/vg0/lv0 to 6553600 (4k) blocks.
The filesystem on /dev/vg0/lv0 is now 6553600 blocks long.

root@linux:~# df -h /
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg0-lv0    25G  173M   24G   1% /

View physical volumes:

root@linux:~# pvdisplay
— Physical volume —
PV Name               /dev/sda3
VG Name               vg0
PV Size               64.76 GB / not usable 3.19 MB
Allocatable           yes (but full)
PE Size (KByte)       4096
Total PE              16578
Free PE               0
Allocated PE          16578
PV UUID               YF6kv3-xA34-4UB2-uWYc-W061-e06E-XKiGzj

— Physical volume —
PV Name               /dev/sdb1
VG Name               vg0
PV Size               74.50 GB / not usable 1.03 MB
Allocatable           yes
PE Size (KByte)       4096
Total PE              19073
Free PE               8771
Allocated PE          10302
PV UUID               G8FjdC-PkW4-L0yq-TMsz-cwh7-XoXN-swqZOY

View volume groups:

root@linux:~# vgdisplay
— Volume group —
VG Name               vg0
System ID
Format                lvm2
Metadata Areas        2
Metadata Sequence No  6
VG Access             read/write
VG Status             resizable
MAX LV                0
Cur LV                2
Open LV               1
Max PV                0
Cur PV                2
Act PV                2
VG Size               139.26 GB
PE Size               4.00 MB
Total PE              35651
Alloc PE / Size       26880 / 105.00 GB
Free  PE / Size       8771 / 34.26 GB
VG UUID               DY0Hsk-vT57-pMnV-Rrgm-u2Hb-U8kh-Xqvgcp

View logical volumes:

root@linux:~# lvdisplay
— Logical volume —
LV Name                /dev/vg0/lv0
VG Name                vg0
LV UUID                By4T9J-wPhq-fDYt-JuNE-t3Bc-UoB8-CC6TaV
LV Write Access        read/write
LV Status              available
# open                 1
LV Size                25.00 GB
Current LE             6400
Segments               2
Allocation             inherit
Read ahead sectors     auto
– currently set to     256
Block device           254:0

— Logical volume —
LV Name                /dev/vg0/lv1
VG Name                vg0
LV UUID                ZjPt4x-AHK7-Q0tj-JXSZ-p9Kw-kg2C-GIdbJN
LV Write Access        read/write
LV Status              available
# open                 0
LV Size                80.00 GB
Current LE             20480
Segments               2
Allocation             inherit
Read ahead sectors     auto
– currently set to     256
Block device           254:1

Storage

Parity in a RAID system

Published by:

In the future I’d like to discuss some tests I’ve done with linux md-raid and hardware raid, as well as write up a few instructional documents on md-raid, but thought it might be interesting to talk a bit about parity before digging into some details on raid 5 performance, caching, and stripe size. This is a pretty basic explanation, but hopefully it will give some insight to those beginner/intermediate folks like me regarding exactly how we’re able to provide fault tolerance and redundancy simply by adding one extra drive to the array.

You may already be familiar with parity or at least heard of it, perhaps in the context of a parity bit, which works in a similar manner but is used for error detection. More on that some other time, perhaps. Most system admins have at least a general idea of what the parity data in a RAID array is, i.e. ‘extra data that can be used to rebuild an array’, but I find it interesting to go through the exercise of exactly how it works.

If you’re only somewhat familiar with how RAID 5 works, you may have at least heard something about XOR calculations or XOR hardware on your controller. XOR is a logic operation, which stands for “exclusive or”, meaning ‘I’ll take this or that but not both’. Without getting too much into what that means, it’s basically just binary addition. In other words, if we were to XOR bits ‘0’ and ‘0’, we’d get a ‘0’ (0 + 0 = 0). If we XOR bits ‘1’ and ‘0’, we get 1. It doesn’t get tricky until we XOR bits ‘1’ and ‘1’, which basically rolls the digit over and we get a binary 2, or ’10’. We’re only interested in the least significant bit, however, so in this case 1 + 1 = 0. Using the following four equations, we should be able to XOR anything we need to:

  • 0 + 0 = 0 (nothing)
  • 1 + 0 = 1 (I’ll take this)
  • 0 + 1 = 1 (I’ll take that)
  • 1 + 1 = 0 (but not both)

Now, let’s apply this to our parity striped raid set. In the following example, we’ll look at a single stripe set across three disks. Two hold data and the third holds parity for that data.

If we lose Stripe 1, we can determine what it held by reversing the equation, or solving ? + 0 = 1.  Likewise, if we lose Stripe 2 we can do the same ( 1 + ? = 1). If we lose the Parity stripe we haven’t really lost any data and can just recalculate parity. Now, the interesting thing about the math is that rather than looking at it as, for example, 1 + ? = 1 when we lose Stripe 2, we can restore Stripe 2 by doing an XOR with Stripe 1 and the Parity stripe, 1(Stripe 1) + 1(Parity) = 0(Stripe 2)

Now, in reality, the stripes are much larger than one single bit as described above, you’ll likely have a stripe of 4 kilobytes up to 256 kilobytes or maybe more, but the mechanics are the same.  Let’s upgrade the previous example to a three bit stripe just to show how it works with multiple bits in a stripe.  Stripe 1 will contain ‘101’, stripe 2 will contain ‘011’. So, to calculate the parity, we match up the bit placements of stripe 1 and 2 and then XOR each set of bits individually. It helps to stack them vertically and work on each column one by one, as the following figure shows:

And again, an example of ‘losing’ stripe 2 and recalculating from parity.

Now, that’s great, but what about four, five, six disk arrays?  Well, once you get the idea of how XOR works, it’s pretty simple. If you understand binary math you can continue to use that to add four, five, six bits together, but another rule of thumb is to think of ‘0’ as even and ‘1’ as odd. For example, 1+1+1, an odd plus an odd plus an odd will be an odd number, so parity equals 1. Another example, 0+1+0+1, an odd plus an even will be odd, plus an even will again be odd, plus an odd will be even, so parity equals 0.  If that gets too confusing, you can simply count how many ‘1’s you have. If you’ve got an odd number of ‘1’s, then parity is 1, if you have an even number of ‘1’s, parity is 0.

These parity calculations are fairly simple, yet interesting for the fact that we can provide protection for the data on any number of disks by simply adding a single disk.  No need to have a full copy of the data. The caveat, of course is that you can only lose 1 disk per parity stripe. Yes, that’s right, you can have dual parity as well, also known as RAID 6.

Another drawback is that while you’re missing one of your disks any stripe sets that had data on that disk will take a performance hit because they’ll be doing XOR calculations to read the data is lost, rather than just reading it from the now lost disk. Stripes that had parity on the lost disk won’t take a performance hit, but it will take some processing time to rebuild the parity once the disk is replaced. These performance hits are going to be an important part of later discussions regarding raid 5 performance.

Probably worst of all, in order to modify data on disk, the system has to read the *entire stripe set* in order to recalculate parity with the changed data, effectively causing a read along with every write. This is where on-card caches can be helpful.

To finish off the exersize, I’ve put together this simple RAID 5 array. Each disk has four stripe sets, each stripe set (in matching color), has three data stripes and one parity stripe. You can see that the parity stripe(???) alternates on to a different disk for each stripe, which is the primary difference between RAID 5 and RAID 4, which places all parity stripes on the same disk. If you’d like to test yourself and see just how your data is protected, first calculate the parity stripe for each set, then cover up a whole column (disk) at random and see if you can reconstruct its contents by doing the math.

As always, if anyone is reading this ;-), feel free to leave your feedback, correct me, whatever.