Friday, October 6, 2017

ZFS disk preparation

ZFS disk preparation is kind of legacy topic as more and more newer systems start to support EFI (GPT) labeling for disks on the ZFS root pool. So this post is about the legacy SMI (VTOC) labeled disks, not EFI (GPT) labeled disks. In general, the old scheme may still appear in the years to come, so better take some some notes on it, just in case.

There are lots and lots of perfect information on official documentation, good books, articles and posts, but still not as simple and straightforward as one would probably desired. Frequently, one has to dig into lots of information until finding the exact steps that fit the bill. Hence, I'll take some time in trying to add some little more contribution for fixing these shortcomings.

NOTE
I need to recall that the term slice and partition always refers to the same thing under a SPARC platform, but not so under an i86pc platform when dealing with the SMI (VTOC) scheme. On the later platform, a slice is implicitly understood as a sub-partition and a partition is usually spoken as fdisk partition. History tells this was so to help multi-boot coexistence with other systems, which is fully deprecated. As good practice (and sanity check) keep all slices on the same (primary) partition and rest in peace. With this good practice, most of the burden goes away and one can once more interchangeably use the terms partition and slice.

The goal is to set up an appropriate partition (slice) map in order to assure that a single mountable partition covers the whole disk, its maximum usable area. This is important in order to get the most out of ZFS by letting it enable disks' local caches and get rid of other alien coexistence. Therefore, this addresses the bold recommended to dedicate whole disks to ZFS.

For instance, let's say that the disk c8t1d0 is to be prepared in order to establish a mirrored root pool for a certain system (by the way, disk naming schemes are not part of this post). Let's also assume, at this moment and for the sake of simplicity, that the disk is already recognized by the system (usually said to be configured or available), hence listed by the format utility. Under these assumptions, one way (there are variations) to prepare it for the ZFS root pool is as follows:

# format
Searching for disks...done
 
AVAILABLE DISK SELECTIONS:
     0. c8t0d0 <...>
        /pci@0,0/pci8086,2829@1f,2/disk@0,0
     1c8t1d0 <...>
        /pci@0,0/pci8086,2829@1f,2/disk@1,0
     ...
Specify disk (enter its number): 
1

 
Under an i86pc platform, if the chosen disk had been previously used, better re-format it and you begin by deleting any current partitions on it and then creating a single SOLARIS2 partition. This can be achieved by choosing the fdisk subcommand, followed by options 3 (as needed and until no alien partition is left), 1 and 6. Then one should get something similar to:
 
     Total disk size is ... cylinders
     Cylinder size is ... (512 byte) blocks

                                   Cylinders
Partition Status Type         Start  End Length  %
========= ====== ============ =====  === ====== ===

1         Active Solaris2         1  ...  ....  100

  
SELECT ONE OF THE FOLLOWING:
   1. Create a partition
   2. Specify the active partition
   3. Delete a partition
   4. Change between Solaris and Solaris2 Partition IDs
   5. Edit/View extended partitions
   6. Exit (update disk configuration and exit)
   7. Cancel (exit without updating disk configuration)
Enter Selection:
6


A simpler case happens if originally no alien partitions were present such as when the disk is brand new or have been cleaned up previously. When initially choosing such a disk the format command will display:

...
selecting c8t1d0
[disk formatted]
No Solaris fdisk partition found.
...

Under this condition the fdisk subcommand will again inform the same and ask what should be done, which answer should by y to create a single i86pc partition (here, not a slice) of type SOLARIS2 for the whole disk capacity:

No fdisk table exists.
The default partition for the disk is:

  a 100% "SOLARIS System" partition

Type "y" to accept the default partition, 
otherwise type "n" to edit the partition table.

y

 
Although not strictly required at this point, after any of the above cases, it may be of good practice committing the changes so far (the fdisk partition creation also creates a default slice map within the fdisk partition), by doing as follows:
 
format> label
Ready to label disk, continue? yes


At this point, both for SPARC and i86pc platform, it's necessary to define the partition (slice) 0 (traditionally called the root partition if back on the old age of UFS file-systems) as covering the maximum available disk capacity. Fortunately, it's possible to cover both platform cases with the same sequence of required format subcommands as follows (but the example below is from an i86pc platform):
   
format> partition 
...


format> modify
Select partitioning base:
        0. Current partition table (original)
        1. All Free Hog
Choose base (enter number) [0]?
1

Part     Tag Flag Cylinders      Size         Blocks
0       root wm   0             0      (0/0/0)          0
1       swap wu   0             0      (0/0/0)          0
2     backup wu   0 - 1020   1021.00MB (1021/0/0) 2091008
3 unassigned wm   0             0      (0/0/0)          0
4 unassigned wm   0             0      (0/0/0)          0
5 unassigned wm   0             0      (0/0/0)          0
6        usr wm   0             0      (0/0/0)          0
7 unassigned wm   0             0      (0/0/0)          0
8       boot wu   0 -    0      1.00MB (1/0/0)       2048
9 alternates wm   0             0      (0/0/0)          0

Do you wish to continue creating a new partition
table based on above table[yes]?
Free Hog partition[6]?
0
Enter size of partition '1' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '3' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '4' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '5' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '6' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '7' [0b, 0c, 0.00mb, 0.00gb]:

Part     Tag Flag Cylinders     Size            Blocks

0       root wm   1 - 1020  1020.00MB (1020/0/0) 2088960
1       swap wu   0            0      (0/0/0)          0
2     backup wu   0 - 1020  1021.00MB (1021/0/0) 2091008
3 unassigned wm   0            0      (0/0/0)          0
4 unassigned wm   0            0      (0/0/0)          0
5 unassigned wm   0            0      (0/0/0)          0
6        usr wm   0            0      (0/0/0)          0
7 unassigned wm   0            0      (0/0/0)          0
8       boot wu   0 -    0     1.00MB (1/0/0)       2048
9 alternates wm   0            0      (0/0/0)          0

Okay to make this the current partition table[yes]?
Enter table name (remember quotes):
"c8t1d0"

Ready to label disk, continue? y

partition> print
Current partition table (c8t1d0):
Total disk cylinders available: 1020 + 2 (reserved cylinders)

Part     Tag Flag Cylinders     Size         Blocks
0 unassigned wm   1 - 1019  1019.00MB (1019/0/0) 2086912
1 unassigned wm   0            0      (0/0/0)          0
2     backup wu   0 - 1019  1020.00MB (1020/0/0) 2088960
3 unassigned wm   0            0      (0/0/0)          0
4 unassigned wm   0            0      (0/0/0)          0
5 unassigned wm   0            0      (0/0/0)          0
6 unassigned wm   0            0      (0/0/0)          0
7 unassigned wm   0            0      (0/0/0)          0
8       boot wu   0 -    0     1.00MB (1/0/0)       2048
9 unassigned wm   0            0      (0/0/0)          0


partition>
label
Ready to label disk, continue? y
 
partition> quit
format> quit


NOTE
Just for curiosity, the Flags column above can have 4 possible values:
  • wm : writable-mountable
  • wu : writable-unmountable
  • rm : readable-mountable
  • ru : readable-unmountble
One can also see that under an i86pc, s8 is assigned 1 cylinder, cylinder 0, for holding some boot information, which size, in this particular disk geometry, takes 2048 blocks totaling 1 MB. Therefore, each block is 512 bytes long, which suggests that a block is the same thing of a sector because, in general, it's invariably that long on current disks.
Invoke the format command once more to double-check the results. I have faced a situation that upon doing so, the following message appeared:

Note: detected additional allowable expansion storage space
that can be added to current SMI label's computed capacity.
Select to adjust the label capacity.


In following the above instructions another messaged appeared:
(which I noted that s0 hadn't been updated accordingly)

Expansion of label cannot be undone; continue (y/n) ? y
The expanded capacity was added to the disk label and "s2".
Disk label was written to disk. 


So to close the above loop, I repeated all the previously shown steps since the modify sub-command and all seemed fine since then. This time I "gained" just 1 MB, but who knows?

NOTE
It's noticeable that partition 2 (slice s2) overlaps with partition 0 (slice s0). This is not an issue, but forces the use of -f flag with the zfs attach command. It should be said that this slice must be kept in order to make possible an eventual (re)installation of the boot-loader (at least under the i86pc platform).
Now that things are presumably more clear, it can be recognized as simple after all, which, in fact, it is. But one issue remains: it's not automated, thus inefficient or impractical if many systems are to be prepared in a row, such as on a not so small virtualization or cloud infrastructure. Fortunately the procedure can be streamlined, as long as all the disks have equal geometry (which makes sense for deployments in large chunks). What needs to be done is:
  1. Do the manual process once and then save the result as a template:
    # prtvtoc /dev/rdsk/c8t1d0s2 >/tmp/vtoc-template
     
  2. Repeatedly apply the template, preferably via some ordinary scripting strategy:
    # fmthard -s /tmp/vtoc-template /dev/rdsk/c?[t?]d?s2 
       
 And that's it for SMI (VTOC) labeled disks preparation for ZFS.