Reliably boot Fedora with root on ZFS

There is a newer version of this page for users of ZFS-0.7.x.y and later.

Tested on 2017-09-03 with

Fedora 25


Introduction

This article shows how to boot a recent version of Fedora (circa 24) with root on ZFS. To distinguish this article from the crowd, the features include:

I was motivated to collect and publish these methods for several reasons: Most of the root-on-zfs articles are either out of date, don't apply to Fedora, or have a key defect: They may work (or might have worked), but they don't survive automatic kernel updates. In other words, if you run dnf update and it includes a new kernel, the system won't boot. That is the problem this article shows how to fix.

UPDATE: My fans and detractors remind me to say that ZFS is now very robust on almost all Linux distributions. And I should add that I've never lost a byte of data. Any hint of peevishness comes from pushing the envelope: Both ZFS and Fedora are rapidly evolving: With two moving targets, some extra effort should be expected.

Why ZFS?

ZFS has significant advantages because it replaces things like software RAID and LVM with a more flexible and general framework. Here are just a few highlights:

Why root on ZFS?

If you only knew the power of ZFS (wheeze gasp), other solutions would seem ugly and ineffective.

Quick summary of the process

  1. Create a separate Fedora installation on installer.
  2. Add support for ZFS.
  3. Create two partitions on the target disk: efi, zfs.
  4. Create a ZFS pool and root dataset on zfs.
  5. Copy the operating system on installer to the root dataset on zfs.
  6. Configure efi so linux will use the root on zfs.

To simplify the discussion, only one disk is used for the ZFS pool. The process of building more complex pools is covered on dozens of ZFS websites. I've written one myself. It is, however, both unusual and unwise to deploy ZFS without some form of redundancy. With that in mind, this article should be treated as a tutorial rather than a practical solution. Proceed at your own risk.

Obtain Fedora

I used:

Fedora-Workstation-Live-x86_64-24-1.2.iso

Configure the BIOS

Configure your BIOS to operate the target disk controller in AHCI mode (the default on most modern motherboards) and boot from the device where you've mounted the installation media. You should see two choices for the installation volume: One will mention UEFI. That's the one you must use. Otherwise look for settings that specify booting in UEFI mode. If the installer doesn't believe you booted using EFI firmware, it will gype up the rest of the process, so take time to figure this out.

Create an installer

We will start by creating an installer: a complete Fedora installation that has the same GPT, UEFI, and partition structure as our target zfs system.

Because we're going to copy the root of the installer to the target, it makes sense to configure the installer to be as similar to the target as possible. After we do the copy, the target is almost ready to boot. A nice thing about UEFI on GPT disks is that no "funny business" is written to boot blocks or other hidden locations. Everything is in regular files.

We're going to make the target system with GPT partitions. The installer needs to have the same structure, but Anaconda will not cooperate if the installer is a small disk. To force Anaconda to create GPT partitions, we must add an option.

Boot the installation media. On the menu screen, select

Start Fedora-workstation-live xx

Press "e" key to edit the boot line. It will look something like this:

vmlinuz ... quiet

At the end of the line add the string "inst.gpt" so it looks like this:

vmlinuz ... quiet inst.gpt

IMPORTANT: If you see an instruction to use the tab key instead of the "e" key to edit boot options, you have somehow failed to boot in EFI mode. Go back to your BIOS and try again.

Proceed with the installation until you get to the partitioning screen. Here, you must take control and specify standard (not LVM) partitioning.

Create two:

1) A 200M parition mounted at /boot/efi
2) A partition that fills the rest of the disk mounted at root "/".

Anaconda will recognize that you want a UEFI system. Press Done to proceed. You'll have to do it twice because you must be harassed for not creating a swap partition.

The rest of the installation should proceed normally. Reboot the new system and open a terminal session. Elevate yourself to superuser:

su 

Then update all the packages:

dnf update

Disable SELinux

Everything is supposed to work with SELinux. But I'm sorry to report that everything doesn't: Fixes are still being done frequently as of 2016. Unless you understand SELinux thoroughly and know how to fix problems, it's best to turn it off. In another year, this advice could change.

Edit:

/etc/sysconfig/selinux

Inside, disable SELinux:

SELINUX=disabled

Save and exit. Then reboot:

shutdown -r now

Check for correct UEFI installation

dnf list grub2-efi shim

These packages should already be installed. If not, you somehow failed to install a UEFI system.

Add extra grub2 modules

dnf install grub2-efi-modules

This pulls in the zfs module needed by grub2 at boot time.

Create some helper scripts

The following scripts will save a lot of typing and help avoid mistakes. What they do and how they work will be explained as we go along.

The zmogrify script

Create a text file "zmogrify" in /usr/local/sbin (or somewhere on the PATH) that contains:

#!/bin/bash
# zmogrify - Run dracut and configure grub to boot with root on zfs.

kver=$1
sh -x /usr/bin/dracut -fv --kver $kver
mount /boot/efi
grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
grubby --set-default-index 0
mkdir -p /boot/efi/EFI/fedora/x86_64-efi
cp -a /usr/lib/grub/x86_64-efi/zfs.mod /boot/efi/EFI/fedora/x86_64-efi
umount /boot/efi

Save the file and make it executable:

chmod a+x zmogrify

What is zmogify doing?

The first step runs dracut to produce a new /boot/initramfs file that includes zfs support. Dracut is able to do this because of the zfs-dracut package we added, which installs a "plug-in."

The peculiar way of running dracut with "sh -x" overcomes a totally mysterious problem with Fedora 25 that makes dracut hang otherwise. Dracut will hang after displaying the line;

*** Including module: kernel-modules *** 

If you understand why, I'd very much like to hear from you.

The grub2-mkconfig creates a new grub.cfg script in the EFI boot partition.

The grubby line makes sure your new kernel is the one that boots by default.

Finally, we create a directory in the EFI partition and copy the boot-time version of the zfs module needed by grub2 to mount your zfs root file system.

The zenter script

Create a text file "zenter" in /usr/local/sbin (or somewhere on the PATH) that contains:

#!/bin/bash
# zenter - Mount system directories and enter a chroot

target=$1

mount -t proc /proc $target/proc
mount --rbind /sys $target/sys
mount --rbind /dev $target/dev

chroot $target

Save the file and make it executable:

chmod a+x zenter

Make grub2 include zfs support when building configuration files

Edit:

/etc/default/grub

Add the line:

GRUB_PRELOAD_MODULES="zfs"

This will come into play later when we create a grub configuration file on the efi partition.

Add udev rules for grub2-mkconfig

This overcomes a limitation that makes the grub2-mkconfig script fail if your pool devices don't have simple device names. Details will be explained later.

Create a file:

/etc/udev/rules.d/70-grub2-fix.rules

Add these two long lines (I divided them using \ only for display.)

ENV{DEVTYPE}=="partition", IMPORT{parent}="ID_*", \
ENV{ID_FS_TYPE}=="zfs_member", \
SYMLINK+="$env{ID_FS_UUID}"

ENV{DEVTYPE}=="partition", IMPORT{parent}="ID_*", \
ENV{ID_FS_TYPE}=="zfs_member", \
SYMLINK+="$env{ID_BUS}-$env{ID_SERIAL} $env{ID_BUS}-$env{ID_SERIAL}-part%n"

One last grub fix

This is our last and most complicated script. Actually, we're going to modify an existing script. This overcomes a current bug (circa 2016-10) that prevents root on zfs from booting.

Normally, the kernel boot line for a zfs root looks like this:

linuxefi ... -root=ZFS=/pool/ROOT/fedora ro ...

Because of a recent labor dispute between grub2 and dracut over boot time, this line will fail. The fix is to remove the "root=..." argument entirely. In that case, dracut will look for the first pool it finds with a bootfs property and use the value as root. This is obviously not a completely satisfactory solution, but it has the unarguable merit of working. I expect this fix to be unnecessary Real Soon Now.

Rather than editing the grub.cfg file directly as some savants suggest, we're going to modify the script that creates grub.cfg - The advantage of this method is that it works automatically: When you do a dnf update, it might trigger some other installer script to run grub2-mkconfig again.

First, preserve a copy of the original script:

cp /etc/grub.d/10_linux /root/old_grub_10_linux

Before we start, it might help to review what we're doing: The goal is to make the script leave out the entire "root=" clause when grub is working with a zfs filesystem and behave normally otherwise.

Edit:

/etc/grub.d/10_linux

On or about line 58 you'll see a case statement that switches on the variable GRUB_FS. Change the case statement by introducing the variable ROOTEQ and setting LINUX_ROOT_DEVICE to an empty string. We add an "otherwise" case to set the default. When you're done, it should look like this:

...
case x"$GRUB_FS" in
    xbtrfs)
        ROOTEQ="root="
        rootsubvol="`make_system_path_relative_to_its_root /`"
        rootsubvol="${rootsubvol#/}"
        if [ "x${rootsubvol}" != x ]; then
        GRUB_CMDLINE_LINUX="rootflags=subvol=${rootsubvol} ${GRUB_CMDLINE_LINUX}"
        fi;;
    xzfs)
        ROOTEQ=""
        rpool=`${grub_probe} --device ${GRUB_DEVICE} --target=fs_label 2>/dev/null || true`
        bootfs="`make_system_path_relative_to_its_root / | sed -e "s,@$,,"`"
        LINUX_ROOT_DEVICE=""
        ;;
    *)
        ROOTEQ="root="
        ;;
esac
---

Now search down for the string "root=" and replace it with ${ROOTEQ} wherever it occurs. There are two places to fix. They should end up looking like this:

...
${linuxefi} ${rel_dirname}/${basename} ${ROOTEQ}${linux_root_device_thisversion} ro ${args}
...
linux${sixteenbit} ${rel_dirname}/${basename} ${ROOTEQ}${linux_root_device_thisversion} ro ${args}
...

Save and exit. Put a copy in a safe place:

cp /etc/grub.d/10_linux /root/new_grub_10_linux

Prepare for future kernel updates

ln -s /usr/local/sbin/zmogrify /etc/kernel/postinst.d

The scripts in the postinst.d directory run after every kernel update. The zmogrify script takes care of updating the initramfs and the grub.cfg file on the EFI partition. (Stuff you need to boot with root in a zfs pool.)

Configure the ZFS repository

Using your browser, visit ZFS on Fedora.

Click on the link for your version of Fedora. When prompted, open it with "Software Install" and press the install button. This will add the repository zfs-release to you package database. It tells the package manager where to find zfs packages and updates.

Install zfs

dnf install kernel-devel
dnf install zfs zfs-dracut

During the installation, the process will pause to build the spl module and then the zfs module. A line will appear showing the path to their locations in /lib/modules/x.y.z/extras. If you don't see those lines, you don't have a match between the kernel-devel package you installed and your currently running kernel.

Load zfs

modprobe zfs

If you get a message that zfs isn't present, the modules failed to build during installation. Try rebooting.

Configure systemd services to their presets

For some reason, (2016-10) they don't get installed this way, so execute:

systemctl preset zfs-import-cache
systemctl preset zfs-mount
systemctl preset zfs-share
systemctl preset zfs-zed

Prepare a target disk

If your target is a USB disk, you simply need to plug it in. Tail the log and note the new disk identifier:

journalctl -f

If you have a "real" disk, first look at your mounted partitions and take note of the names of your active disks. If you're using UUIDs, make a detailed list:

ls /dev/disk/by-uuid

The shortcuts will point to the /dev/sdn devices you're using now. Now you can shutdown, install the new disk, and reboot. You should find a new disk when you take a listing:

ls /dev/sd*

From now on, we'll assume you've found your new disk named "sdx".

If you got your target disk out of a junk drawer, the safest way to proceed is to zero the whole drive. This will get rid of ZFS and any RAID labels that might cause trouble. Bring up a terminal window, enter superuser mode, and run:

dd if=/dev/zero of=/dev/sdx bs=10M status=progress

This takes quite a while for large disks. If you know which partition on the old disk has a zfs pool, for example sdxn, you can speed things up by importing the pool and destroying it. But if the partition is already corrupted so the pool won't import properly, you can blast it by zeroing the first and last megabyte:

mysize=`blockdev --getsz /dev/sdx`
dd if=/dev/zero of=/dev/sdxn bs=512 count=2048
dd if=/dev/zero of=/dev/sdxn bs=512 count=2048 seek=$((mysize - 2048))

Partition your target device

I'm too lazy to create a step-by-step gdisk walk-through. If you need such a thing, you probably don't belong here. Run gdisk, parted, or whatever tool you prefer on /dev/sdx. Erase any existing partitions and create two new ones:

Partition  Code        Size      File system   Purpose
        1  EF00     200 MiB      EFI System    EFI boot partition
        2  BF01  (rest of disk)  EXT4          ZFS file system

Write the new table to disk and exit. Tell the kernel about the new layout:

partprobe

If you are prompted to reboot, do so.

Format the target EFI partition

mkfs.fat -F32 /dev/sdx1

Create a pool

zpool create pool -m none /dev/sdx2 -o ashift=12

By default, zfs will mount the pool and all descendant datasets automatically. We turn that feature off using "-m none". The ashift property specifies the physical sector size of the disk. That turns out to be a Big Ugly Deal, but you don't need to be concerned about it in a tutorial. I've added a section at the end of the article about sector sizes. You need to know this stuff if you're building a production system.

Configure performance options

zfs set compression=on pool
zfs set atime=off pool

Compression improves zfs performance unless your dataset contains mostly already-compressed files. Here we're setting the default value that will be used when creating descendant datasets. If you create a dataset for your music and movies, you can turn it off just for that dataset.

The atime property controls file access time tracking. This hurts performance and most applications don't need it. But it's on by default.

Create the root dataset

zfs create -p pool/ROOT/fedora

Enable extended file attributes

zfs set xattr=sa pool/ROOT/fedora

This has the side effect of making ZFS significantly faster.

Specify a default bootfs for the pool

zpool set bootfs=pool/ROOT/fedora pool

If no root=ZFS=... clause appears on the kernel command line at boot time, zfs-dracut will search for a pool that has the bootfs property and use the value of that property as the root file system.

Provide some redundancy

If you think you might continue to use this one-device zfs pool, you can slightly improve your odds of survival by using the copies property: This creates extra copies of every block so you can recover from a "bit fatigue" event. A value of 2 will double the space required for your files:

zfs set copies=2 pool/ROOT/fedora

I use this property when booting off USB sticks because I don't trust the little devils.

It would be nice to find two new symbolic links for the partitions we've created on the target disk in here:

/dev/disk/by-uuid

The symbolic link to the EFI partition is always there pointing to /dev/sdx1, but the one for the ZFS partition pointing to /dev/sdx2 is not. I've tried re-running the rules:

udevadm trigger

But the link for /dev/sdx2 still isn't there. Regrettably, you have to reboot now. When you're back up and running proceed...

Export the pool and re-import to an alternate mount point

zpool export pool
zpool import pool -d /dev/disk/by-uuid -o altroot=/sysroot

At this point don't panic if you look for /sysroot: It's not there because there are no specified mount points yet.

You will also notice the clause -d /dev/disk/by-uuid. This will rename the disk(s) so their UUID's appear when you execute zpool status. You could also use -d /dev/disk/by-id - Details about this are covered later.

Note: It's possible to arrive at a state where a newly formatted disk will have an entry in some-but-not-all of the /dev/disk subdirectories. I'm not sure what causes this annoyance, but if the command to import the pool using UUIDs fails, try using /dev/disk/by-id instead. The important thing is to get away from using device names.

Specify the real mount point

zfs set mountpoint=/ pool/ROOT/fedora

Because we imported the pool with an altroot, /sysroot will now be mounted.

Copy root from the installer to the target

rsync -avxHASX / /sysroot/

Don't get careless about the trailing "/" character.

You'll see some errors near the end about "journal". That's ok.

Copy /boot from the installer to the target:

rsync -avxHASX /boot/ /sysroot/boot/

Understanding the rsync command

The nice incantation is from Rudd-O. This is what the options do:

a: Archive
v: Verbose
x: Only local filesystems
H: Preserve hard links
A: Preserve ACLs
S: Handle sparce files efficiently
X: Preserve extended attributes

Copy the EFI partition from the installer to the target

cd
mkdir here
mount /dev/sdx1 here
cp -a /boot/efi/EFI here
umount here
rmdir here

Get the UUID of the target EFI partition

blkid /dev/sdx1

In my case, it was "0FEE-5943"

Edit

/sysroot/etc/fstab

It should have only two lines:

pool/ROOT/fedora  /          zfs   defaults  0 1                          
UUID=0FEE-5943    /boot/efi  vfat  umask=0077,shortname=winnt 0 2

(Obviously, use your EFI partition's UUID.)

Chroot into the target

zenter /sysroot

Trigger the udev rules

udevadm trigger

We have to do this by hand this time because we added the rules without rebooting. The udev rules create symbolic links in the /dev directory expected by grub2-mkconfig. The reasons are explained at the end of this document.

Run zmogrify

zmogrify `uname -r`

Exit the chroot

exit
umount -lf /sysroot

Export the pool

zpool export pool

Reboot

If you're the adventurous type, simply reboot. Otherwise, first skip down to Checklist after update and before reboot and run through the tests. Then reboot.

If all is well, you'll be running on zfs. To find out, run:

mount | grep zfs

You should see your pool/ROOT/fedora mounted as root. If it doesn't work, please try the procedure outlined below in Recovering from boot failure.

Take a snapshot

Before disaster strikes:

zfs snapshot pool/ROOT/fedora@firstBoot

We are finished. Now for the ugly bits!


More about device names

When we created the pool, we used the old-style "sdx" device names. They are short, easy to type and remember. But they have a big drawback. Device names are associated with the ports on your motherboard, or sometimes just the order in which the hardware was detected. It would really be better to call them "dynamic connection names." If you removed all your disks and reconnected them to different ports, the mapping from device names to drives would change.

You might think that would play havoc when you try to re-import the pool. Actually, ZOL (ZFS on Linux) protects you by automatically switching the device names to UUIDs when device names in the pool conflict with active names in your system.

Linux provides several ways to name disks. The most useful of these are IDs and UUIDs. Both are long complex strings impossible to remember or type. I prefer to use IDs because they include the serial number of the drive. That number is also printed on the paper label. If you get a report that a disk is bad, you can find it by reading the label.

First, boot into your installer linux system.

Here's how to use IDs:

zpool import pool -o altroot=/sysroot -d /dev/disk/by-id

If you prefer UUIDs:

zpool import pool -o altroot=/sysroot -d /dev/disk/by-uuid

Should you ever want to switch back to device names:

zpool import pool -o altroot=/sysroot -d /dev

To switch between any two, first export and then import as shown above. The next time you export the pool or shut down, the new names will be preserved in the disk data structures.

Optimize performance by specifying sector size

ZFS will run a lot faster if you correctly specify the physical sector size of the disks used to build the pool. For this optimzation to be effective, all the disks must have the same sector size. Discovering the physical sector size is difficult because disks lie. This history of this conundrum is too involved to go into here. For practical purposes, sector sizes are either 512 bytes or 4096 bytes.

First, ask the disk for its sector size:

lsblk -o NAME,PHY-SeC /dev/sdx

If it reports 4096, you can believe the answer. If it reports 512, there is doubt. Here are some heuristics:

When creating a pool, the sector size is specified as a power of 2. Since 512 bytes is the default, you only need to deal with the 4096 case. You do it using the ashift property:

zpool create pool /dev/sdx2 -m none -o ashift=12

Create more datasets

More complex dataset trees are possible and usual. Some reasons:

  1. To share content between operating systems.
  2. To apply special properties to system directories.
  3. To isolate user accounts and apply quotas.

Examples:

In the future, you may have multiple versions of fedora and possibly other operating systems:

pool/ROOT/fedora24
pool/ROOT/fedora25
pool/ROOT/fedora26
...

Using this construct, all of them will share the same home directories:

zfs create pool/home -o mountpoint=/home

You could break it down further by having a dataset for each user.

For security reasons, it's a good idea to restrict what goes in in the /var tree. To do that:

zfs create pool/ROOT/fedora/var -o canmount=off -o setuid=off  -o exec=off

But some installers blow up if they don't have execute privilege in /var/tmp, so you have to create another:

zfs create pool/ROOT/fedora/var/tmp -o -o exec=on

Dissecting the /tmp directory into datasets with different properties is actually far more complex. This subject will take you into deep water rapidly. Search for advice and make up your own mind.

UPDATE: It is known that Fedora doesn't like to see /usr mounted on a separate dataset. It has to be part of the root file system or the system won't boot.

Similarly, it is possible to create and enable a swap partition as a dataset. This is complicated enough to deserve its own section:

Enable swapping

There is/was a certain amount of controversy about the safety of swapping on ZFS. I simply quote the advice given on the ZOL GitHub site:

zfs create -V 4G -b $(getconf PAGESIZE) \
    -o compression=zle \
    -o logbias=throughput \
    -o sync=always \
    -o primarycache=metadata \
    -o secondarycache=none \
    -o com.sun:auto-snapshot=false pool/swap

Note that "swap" isn't part of ROOT/fedora. That allows us to share the space with other linux installations. The "-V 4g" means this will be a 4G ZVOL - a fixed-size, non-zfs filesystem allocated from the pool. The reason for turning off data caching is that the operating system will manage swap like a cache anyway. The same reasoning and cache settings are used for ZVOLs created for virtual machine disk images.

After you're running with root on ZFS, complete swap setup by adding a line to /etc/fstab:

/dev/zvol/pool/swap none swap defaults 0 0

Format the swap dataset:

mkswap -f /dev/zvol/pool/swap

And enable swapping:

swapon -av

The article reminds us:

  1. Make sure the devices in the pool are named using UUIDs or IDs. (Not device names.)
  2. Don't enable hibernation: The memory can't be restored from swap in ZFS because the pool isn't accessible when the hardware tries to return from hibernation.

Deal with ZFS updates

Most of this article is about making kernel updates as painless with root on zfs as they are without. But when an update comes out for ZFS without a kernel update, there can be problems.

After performing a dnf update that includes a new ZFS, The dkms process will run and build new spl and zfs modules. Unfortunately, the installers don't run dracut so the initramfs won't contain the new modules. You should be able to boot since the initramfs is self-consistent. If you want everything to be up to date, run this before rebooting:

zmogrify `uname -r`

If the update group includes both zfs and kernel updates, there is a puzzle - if the kernel was installed last, it will contain the new zfs modules because of our post-install script. But if zfs gets installed last, things will be out of whack. To be sure:

zmogrify <your new kernel version>

This annoyance is the last frontier to making root on zfs transparent to updates. Without sounding peevish (I hope) I want to point out that there's a dkms.conf file in the zfs source package that contains the setting:

REMAKE_INITRD="no"

It would be nice if this option was changed to "yes" - Then zfs updates would be totally carefree provided you've done all the other hacks described in this article. The issue was debated by the zfs-on-linux developers and they decided otherwise. Dis aliter visum.

Checklist after update and before reboot

Here's a list of steps you can take that help prevent boot problems. I won't hurt to do this after any update that includes a new kernel or version of zfs.

To save typing, I'll assume your new kernel version is in the variable kver.

Create it like this (for example)

kver=4.7.7-200.fc24.x86_64

You also may need the version of zfs. For example:

zver=0.6.5.8

First, check to see if the modules for zfs and spl are present in:

/lib/modules/$kver/extras

If they aren't there, build them now:

dkms install -m spl -v $zver -k $kver
dkms install -m zfs -v $zver -k $kver

Next, check to see if there is an initramfs file for your new kernel:

ls /boot/initramfs-$kver.img

If not, you need to run dracut:

dracut -fv --kver $kver

Next, check to see if the zfs module is inside the new initramfs:

lsinitrd -k $kver | grep "zfs.ko"

If it's not, run dracut:

dracut -fv --kver $kver

Check to see if your new kernel is mentioned in grub.cfg:

grep $kver /boot/efi/EFI/fedora/grub.cfg

If it's not, you need to build a new configuration file:

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

And make sure your new kernel is the default:

grubby --default-index

The command should return 0 (zero). If it doesn't:

grubby --set-default-index 0

If your system passes all these tests, it will probably boot into the new kernel with root on zfs.

Recovering from boot failure

There are three common problems after a kernel or zfs update:

  1. The system boots to the previous kernel, rather than the new one.
  2. The system hangs after a few lines into the boot process.
  3. Dracut complains about mounting the root.

You can usually avoid these horrors by doing the recommended checks. But accidents happen...

The system boots to the previous kernel

When new kernel is installed on an EFI system, the grub2-mkconfig scrip is never run (but it should be.) The installer script does run grubby, which may report:

grubby fatal error: unable to find suitable template

If you see this or any other fatal error message from grubby, don't reboot. Instead do the by-now familiar dance:

zmogrify <new kernel version>

You are now (hopefully) good to go.

If the grubby fatal error bothers you, you can prevent it by deleting the grub.cfg file before proceeding with a kernel update. You still, of course, have to build it again as shown above.

The system hangs after a few lines into the boot process

The last line you see will probably be:

Reached target Basic System

<DARKNESS>

This happens when the initramfs doesn't contain support for zfs. The solution is to "do the dance" but now you have to get into the broken system's root using your installer system.

Boot the installer device that has zfs support.

Import the problematic pool:

zpool import pool -o altroot=/sysroot

Chroot to the target system:

zenter /sysroot

Now do the deed:

zmogrify <new kernel version>

Back out of the chroot

exit
umount --recursive /sysroot

Export the pool and reboot:

zpool export
shutdown -r now

Dracut complains about mounting the root

You see something like this:

dracut: FATAL: Don't know how to handle 'root=ZFS=pool/ROOT/fedora'

This occurs because somehow your grub2-mkconfig ran with an unmodified 10_linux script. The line dracut complains ought to work, but the current version of zfs-dracut plus various complex issues with grub don't allow it to work.

For a quick test, just interrupt the boot process, edit the menu selection, and remove the "root=" clause. You system should boot normally. For a permanent fix:

  1. Carefully review your modifications to /etc/grub.d/10_linux.
  2. Run blkid to show the UUID of your EFI boot partition.
  3. Make sure that UUID is configured in /etc/fstab to mount /boot/efi.
  4. Run mount to confirm that /boot/efi is in fact mounted there.
  5. Re-run grub2-mkconfig.

Understanding the udev rules

If you ran grub2-mkconfig without special udev rules you'd get an error message like this:

failed to get canonical path of ‘/dev/<some big mess>’

There is an annoying defect in grub2-mkconfig: When looking for the pool devices, it simply prepends "/dev/" to each of the devices listed when you run "zpool status". Since those paths don't exist for devices named by UUIDs or IDs, you get the error message. To deal with this annoyance expediently, we could execute:

cd /dev
ln -l /dev/disk/by-id/<some big mess> <some big mess>

Or if you were using by-uuid:

cd /dev
ln -l /dev/disk/by-uuid/<some other big mess> <some other big mess>

This creates a symbolic link that points to another symbolic link that gets to the physical device and you can now run grub2-mkconfig. If you have a pool made from multiple devices, this would be a tedious process. And all those symbolic links will be gone after you reboot.

A better way is to create a udev rule that creates the links automatically. And that's what we did in a previous section.

If you look in the /dev directory after running udevadm trigger you'll see the new links expected by grub2-mkconfig.

Dealing with grub updates

Occasionally, you'll see updates that include the grub package itself and/or the grub-efi-modules package. When this occurs, it's a good idea to check everything we've done that might be undone by such a update.

Recall that we modified the file:

/etc/grub.d/10_linux

Look in the /etc/grub.d directory. If you see a 10_linux.rpmsave file, it contains your old (modified) version of the fie. In that case you need to apply the edits described in One last grub fix to the new 10_linux file. If, on the other hand, you find a 10_linux.rpmnew file, it is the new version from Fedora. Your old one is still in place, but you should retire it and use the new one by renaming it to simply 10_linux. Then apply the edits described in One last grub fix.

After modifying the 10_linux file, rebuild your grub.cfg file:

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg
grubby --set-default-index 0

Finally, if a new grub-efi-modules package has been installed, copy the new zfs.mod file to the efi partition so you'll use the (possibly) new one:

cp -a /usr/lib/grub/x86_64-efi/zfs.mod /boot/efi/EFI/fedora/x86_64-efi

Rebuilding grubx64.efi

This idea might be interesting to those who want to build a new zfs-friendly Linux distribution.

You'll recall the step where we copied zfs.mod to the EFI partition. This is necessary because zfs is not one of the modules built into Fedora's grub2-efi rpm. Other Linux distributions support zfs without this hack so I decided to find out how it was done.

You can see the list of build-in modules by downloading the grub2-efi source and reading the spec file. It would be straight forward but perhaps a bit tedious to add "zfs" to the list of built-in modules, remake the rpm and install over the old one. An easier way is to rebuild one file found here:

/boot/efi/EFI/fedora/grubx64.efi

The trick is to get hold of the original list of modules. They need to be listed as one giant string without the ".mod" suffix. Here's the list from the current version of grub2.spec with "zfs" already appended:

all_video boot btrfs cat chain configfile echo efifwsetup efinet 
ext2 fat font gfxmenu gfxterm gzio halt hfsplus iso9660 jpeg loadenv 
loopback lvm mdraid09 mdraid1x minicmd normal part_apple part_msdos
part_gpt password_pbkdf2 png reboot search search_fs_uuid search_fs_file 
search_label serial sleep syslinuxcfg test tftp video xfs zfs

Copy the text block above to a temporary file "grub2_modules". Then execute:

grub2-mkimage \
    -O x86_64-efi \
    -p /boot/efi/EFI/fedora \
    -o /boot/efi/EFI/fedora/grubx64.efi \
    `xargs < grub2_modules`

The grub2.spec script does something like this when building the binary rpm.

Now you can delete the directory where we added zfs.mod:

rm -rf /boot/efi/EFI/fedora/x86_64-efi

Better? A matter of taste I suppose. Rebuilding grubx64.efi is dangerous because it could be replaced by a Fedora update. I prefer using the x88_64-efi directory.

The package grub2-efi-modules installs a large module collection in:

/usr/lib/grub/x86_64-efi

The total size of all the .mod files there is about 3MB so you might wonder why not include all the modules? For reasons I don't have the patience to discover, at least two of them conflict and derail the boot process.

Pool property conflicts at boot time

It's possible for the zfs.mod that lives on the EFI partition to get out sorts with the zfs.ko in the root file system. Or more precisely, the set of zfs properties supported in the pool that contains the root file system may not be supported by zfs.mod if the pool was created by a more recent zfs.ko.

Right now (2016-10) I live in a brief era when the version of zfs.mod supplied by grub2-efi-modules agrees with the property set created by default using zfs 0.6.8.5. But such a happy situation cannot last. The pool itself retains the property set used when it was created. But when you create a new pool with a future version of zfs, it may not be mountable by an older zfs.mod.

The fix is to enumerate all the properties your zfs.mod supports and specify only those when creating a new pool. To discover the set of properties supported by a given version of zfs.mod, it appears that you have to study the source. Being a lazy person, I'll just quote a reference that show an example of how a pool is created with a subset of available features.

The following quotation and code block come from the excellent article ArchLinux - ZFS in the section GRUB compatible pool creation.

"By default, zpool will enable all features on a pool. If /boot resides on ZFS and when using GRUB, you must only enable read-only, or non-read-only features supported by GRUB, otherwise GRUB will not be able to read the pool. As of GRUB 2.02.beta3, GRUB supports all features in ZFS-on-Linux 0.6.5.7. However, the Git master branch of ZoL contains one extra feature, large_dnodes that is not yet supported by GRUB."

zpool create -f -d \
    -o feature@async_destroy=enabled \
    -o feature@empty_bpobj=enabled \
    -o feature@lz4_compress=enabled \
    -o feature@spacemap_histogram=enabled \
    -o feature@enabled_txg=enabled \
    -o feature@hole_birth=enabled \
    -o feature@bookmarks=enabled \
    -o feature@filesystem_limits \
    -o feature@embedded_data=enabled \
    -o feature@large_blocks=enabled \
    <pool_name> <vdevs>

The article goes on to say: "This example line is only necessary if you are using the Git branch of ZoL."

Evidently I got away with ZFS 0.6.5.8 because it doesn't have the large_dnodes feature yet or grub got updated. To find out, run:

zpool get all pool | grep feature

For my pool, this shows:

pool  feature@async_destroy       enabled                     local
pool  feature@empty_bpobj         active                      local
pool  feature@lz4_compress        active                      local
pool  feature@spacemap_histogram  active                      local
pool  feature@enabled_txg         active                      local
pool  feature@hole_birth          active                      local
pool  feature@extensible_dataset  enabled                     local
pool  feature@embedded_data       active                      local
pool  feature@bookmarks           enabled                     local
pool  feature@filesystem_limits   enabled                     local
pool  feature@large_blocks        enabled                     local

No sign of large_dnodes yet, but be aware that it's out there waiting for you.

Limiting ARC memory usage

ZFS wants a lot of memory for its ARC cache. The recommended minimum is 1G per terabyte of storage in your pool(s). Without constraints, ZFS is likely to run off with all your memory and sell it at a pawn shop. To prevent this, you should specify a memory limit. This is done using a module parameter. A reasonable setting is half your total memory:

Edit or create:

/etc/modprobe.d/zfs.conf

This expression (for example) limits the ARC to 16G:

options zfs zfs_arc_max=17179869184

The size is in bytes and must be a power of 2:

16GB  = 17179869184
8GB   = 8589934592
4GB   = 4294967296
2GB   = 2147483648
1GB   = 1073741824
500MB = 536870912
250MB = 268435456

The modprobe.d parameter files are needed at boot time, so it's important to rebuild your initramfs after adding or changing this parameter:

dracut -fv --kver `uname -r`

And then reboot.

Running without ECC memory

To operate with ultimate safety and win the approval of ZFS zealots, you really ought to use ECC memory. ECC memory is typically only available on "server class" motherboards with Intel Xeon processors.

If you don't use ECC memory, you are taking a risk. Just how big a risk is the subject of considerable controversy and beyond the scope of these notes.

First, give yourself a fighting chance by testing your memory. Obtain a copy if this utility and run a 24 hour test:

http://www.memtest86.com

This is particularly important on a new server because memory with defects is often purchased with those defects. If your memory passes the test, there is a good chance it will be reliable for some time.

To avoid ECC, you might be tempted to keep your server in a 1-meter thick lead vault buried 45 miles underground. It turns out that many of the radioactive sources for memory-damaging particles are already in the ceramics used to package integrated circuits. So this time-consuming measure is probably not worth the cost or effort.

Instead, we're going to enable the unsupported ZFS_DEBUG_MODIFY flag. This will mitigate, but not eliminate, the risk of using ordinary memory. It's supposed to make the zfs software do extra checksums on buffers before they're written. Or something like that. Information is scarce because people who operate without ECC memory are usually dragged down to the Bad Place.

Edit:

/etc/modprobe.d/zfs.conf

Add the line:

options zfs zfs_flags=0x10

As previously discussed, you need to rebuild initramfs after changing this parameter:

dracut -fv --kver `uname -r`

And then reboot. You can confirm the current value here:

cat /sys/module/zfs/parameters/zfs_flags

Complaints and suggestions

Share your woes by mail with Hugh Sparks. (I like to hear good news too.)

References