Learnings on Solaris™: October 2017

Wednesday, October 25, 2017

Solaris 11 Express awesome update

Solaris 11 Express was a transitional version from Solaris 10 to Solaris 11, post OpenSolaris 2009.06. Since it became publicly available everybody could see by that time how promising Solaris 11 would be, in many ways, part because of OpenSolaris community. Nowadays, one can better recognize what has been achieved up to Solaris 11.3, which by the time of this writing, has been around about 3 years. There's also an early announcement that Solaris 11.4 will be available by mid-2018, and according to planned EoLN (end-of-life notices) chances are that many desktop features will be further trimmed out, although GCC should be finally upgraded, hopefully to version greater than 7, let's see.

NOTE

Although by now Solaris 11 Express is officially obsolete and OpenSolaris has been left behind by Oracle, there are still community efforts to reestablish open-source variants, such as OpenIndiana and SmartOS, both running over a lagging-behind kernel called Illumos based on the original open-source version of the SunOS 5.11. But unfortunately, that kernel isn't as nearly modernized and optimized as Oracle's current closed-source product.

But in spite of all that, Solaris 11 Express major benefit was to early incorporate many Solaris advancements over Solaris 10 and still run under a legacy 32-bits platform! Yes! This is key, because despite the official business strategies and propaganda focusing on 64-bits mid and high-end big-iron, the truth is that there's a lot of legacy hardware can still be put to good service to the crowds on the 3rd world which are thriving to evolve and that do not count on a lot of money and other powerful and current resources.

The initial GA release of Solaris 11 Express didn't perform well, perhaps due to a lot of debugging hooks (code assertions) and conservative strategies, after all it was a key transitional milestone of Solaris. But fact is that those who had payed for a support contract could benefit from regular updates, called SRU (service release updates), which fixed many issues and greatly improved the system performance, including booting speed. By the last general SRU, SRU-13, things were noticeably better.

For instance, Solaris 11 Express SRU-13 rivals the speed of Solaris 11.3 GA and certainly runs faster than OpenIndiana 2017.04. In my opinion, a relative comparison among "recent" Solaris distros could be depicted by the following table:

But wait! Things could become even better because Engineered Systems for high-performance grid-computing started to see the light of the day and some of them were to be driven by Solaris 11 Express! This transitional version of Solaris then became so acclaimed and accredited that it deserved a additional and special SRU update targeted to Exalogic, the SRU-14. The SRU-14 could not be applied to ordinary systems because it had a special dependency associated to Exalogic Engineered System. Of course, there should be good reasons for such constrain. But fact is or at least seems to be that in general it runs amazingly well on ordinary systems too!
To enjoy all the power of SRU-14 on ordinary systems, some homework is necessary in order lift the impeding constrain embedded on the update.

NOTE

I'll assume that a support repository has already been made available by means described on procedures I've visited on the past, such as, the IPS repository update post.

For instance, consider the following local support repository:

# zfs list -o mountpoint -H -r /depot
/depot
/depot/solaris
/depot/solaris/11e
/depot/solaris/11e/release
/depot/solaris/11e/sru-13
/depot/solaris/11e/sru-14

At first, an usual update attempt from SRU-13 to SRU-14 fail:

# pkg update --be-name solaris-11e-sru-14
Creating Plan ...
pkg update: No solution was found to satisfy constraints
Plan Creation: Package solver has not found a solution
               to update to latest available versions.
               This may indicate an overly constrained
               set of packages are installed.

latest incorporations:

pkg://solaris/consolidation/gnome/gnome-incorporation@...151.0.1.14...
pkg://solaris/consolidation/sfw/sfw-incorporation@...151.0.1.14...
pkg://solaris/consolidation/osnet/osnet-incorporation@...151.0.1.14...
pkg://solaris/entire@...151.0.1.14...

The following indicates why the system cannot update to the latest version:

    Reject: pkg://solaris/entire@...151.0.1.14...
    Reason: A version for 'require-any' dependency on
             pkg:/system/platform/exalogic/firstrun cannot be found

From the diagnostic messages above it's possible to realize that the SRU-14 was crafted to be applied as part of an automated installation of Solaris 11 Express target to the Exalogic Engineered System. The only constrain was a missing IPS package delivering a one-time-run SMF service performing initial configurations to Exalogic:

pkg:/system/platform/exalogic/firstrun.

NOTE

It's noticeable that in more recent releases, the description of the technique for tailoring a one-time-run IPS packages has evolved while it was completely lacking on Solaris 11 Express. Nevertheless, the simpler and straightforward instructions found on Solaris 11/11 Information Library is enough to perfectly work under Solaris 11 Express. Despite the evolution in documentation it still lacks a lot of clarity by sticking to do it that way instead of building knowledge, unfortunately. For this post I'll stay as much as possible with the clearer and simpler procedures which seem to still work equally well in terms of backward compatibility.

The major steps in creating the missing package are:

Creating a SMF service manifest for a dummy service.
Deploying the special IPS package unlocking SRU-14.

The above steps does nothing more than completing the set of requirements that unlock the SRU-14 installation. I'm not sure if I could create an "empty" package, that is, maybe the dummy SMF service is unneeded after all. Anyway, the difficulties rely just on the intrinsics of these steps themselves, not on the big picture. Let's visit each of the above steps in more detail:

1. Creating a SMF service manifest for a dummy service.

In recent versions of Solaris, this has been somewhat simplified by svcbundle(1M), but I won't rely on it at this moment. I prefer to know all the details and stay in control as much as possible.

# mkdir /tmp/sru-14-unlock
# cd !!$

# cat sru-14-unlock.xml

...
<?xml version="1.0" ?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">

<service_bundle name="sru-14-unlock" type="manifest">

<service name="sru-14-unlock" type="service" version="1">

    <create_default_instance enabled="false"/>
    <single_instance/>

    <dependency name="multi_user" type="service" grouping="require_all" restart_on="none">
      <service_fmri value="svc:/milestone/multi-user:default"/>
    </dependency>

    <exec_method name="start"   type="method" exec=":true" timeout_seconds="60"/>
    <exec_method name="stop"    type="method" exec=":true" timeout_seconds="60"/>
    <exec_method name="refresh" type="method" exec=":true" timeout_seconds="60"/>


    <property_group name="startd" type="framework">
      <propval name="duration" type="astring" value="transient"/>
    </property_group>

</service>

</service_bundle>

# svccfg validate !!$
...

2. Deploying the special IPS package unlocking SRU-14.

This amounts to the creation and installation of the missing package.
What matters most is the package name, highlighted below.

# pwd
/tmp/sru-14-unlock

# mkdir -p ./prototype/lib/svc/manifest/site
# cp sru-14-unlock/sru-14-unlock.xml !!$
...

# cat sru-14-unlock.p5m
set \
name=pkg.fmri \
value=system/platform/exalogic/firstrun@1.0,5.11

set \
name=pkg.summary \
value="SRU-14 unlock."

set \
name=pkg.description \
value="Dummy package to unlock SRU-14 installation."

set \
name=org.opensolaris.smf.fmri \
value=svc:/sru-14-unlock

set \
name=org.opensolaris.consolidation \
value=userland

set \
name=info.classification \
value="org.opensolaris.category.2008:System/Packaging"

file \
path=lib/svc/manifest/site/sru-14-unlock.xml \
mode=0444 owner=root group=sys

NOTE

The extension .p5m most probably means package v5 manifest.

# pkglint !!$
Lint engine setup...
Starting lint run...

This package must be placed into a package repository, from which it can be subsequently installed. If the installation is to be part of an automated install (AI), then the repository must created on a location accessible to AI clients during the first boot. On this post I'm not using AI, so I'll just create the repository under /tmp which suffices to a one-time interactive install.

# pwd
/tmp/sru-14-unlock

# pkgrepo create ./repo
# pkgrepo add-publisher -s ./repo solaris

# pkgsend publish -d ./prototype -s ./repo sru-14-unlock.p5m
pkg://solaris/system/platform/exalogic/firstrun@1.0,5.11:...Z
PUBLISHED

# pkg list -af -g ./repo
NAME (PUBLISHER)           VERSION    IFO
system/platform/exalogic/firstrun    1.0        ---

# pkg info -g ./repo firstrun
       Name: system/platform/exalogic/firstrun
    Summary: SRU-14 unlock.
Description: This dummy package...
      State: Not installed
Publisher: solaris
    Version: 1.0
... Release: 5.11
     Branch: None
   ... Date: ...
       Size: 928.00 B
       FMRI: pkg://solaris/system/platform/exalogic/firstrun...

# pkg install -g ./repo -nv firstrun
           Packages to install:        1
     Estimated space available:   ... GB
Estimated space to be consumed: 14.31 MB
       Create boot environment:       No
Create backup boot environment:       No
          Rebuild boot archive:       No

Changed packages:
solaris
system/platform/exalogic/firstrun
    None -> 1.0,5.11:...

# pkg install -g ./repo -v firstrun
...
DOWNLOAD               PKGS       FILES    XFER (MB)
Completed               1/1         1/1      0.0/0.0

PHASE                                        ACTIONS
Install Phase                                    7/7

PHASE                                          ITEMS
Package State Update Phase                       1/1
Image State Update Phase                         2/2

PHASE                                          ITEMS
Reading Existing Index                           8/8
Indexing Packages                                1/1

# pkg info firstrun
       Name: system/platform/exalogic/firstrun
    Summary: SRU-14 unlock.
Description: Dummy package to unlock SRU-14 installation.
   Category: System/Packaging
      State: Installed
Publisher: solaris
    Version: 1.0
...

# svcadm restart manifest-import

On the console one sees:
Loading smf(5) servicd descriptions: 1/1

# svcs -a |grep sru-14
disabled       19:36:52 svc:/sru-14-unlock:default

And voilà!

For a default text-installation of Solaris 11 Express with SRU-13 one gets:

# pkg update -nv --be-name solaris-11e-sru-14
            Packages to update:        15
     Estimated space available:    ... GB
Estimated space to be consumed: 366.68 MB
       Create boot environment:       Yes
     Activate boot environment:       Yes
Create backup boot environment:        No
          Rebuild boot archive:       Yes

Changed packages:

solaris
SUNWcs
    0.5.11,5.11-0.151.0.1.13:... -> 0.5.11,5.11-0.151.0.1.14:...
consolidation/gnome/gnome-incorporation
    0.5.11,5.11-0.151.0.1.13:... -> 0.5.11,5.11-0.151.0.1.14:...
consolidation/osnet/osnet-incorporation
    0.5.11,5.11-0.151.0.1.13:... -> 0.5.11,5.11-0.151.0.1.14:...
consolidation/sfw/sfw-incorporation
    0.5.11,5.11-0.151.0.1.13:... -> 0.5.11,5.11-0.151.0.1.14:...
database/sqlite-3
    3.6.23,5.11-0.151.0.1.4:... -> 3.7.5,5.11-0.151.0.1.14:...
entire
    0.5.11,5.11-0.151.0.1.13:... -> 0.5.11,5.11-0.151.0.1.14:...
image/library/libpng
    0.5.11,5.11-0.151.0.1:...    -> 0.5.11,5.11-0.151.0.1.14:...
library/desktop/gtk2
    0.5.11,5.11-0.151.0.1:...    -> 0.5.11,5.11-0.151.0.1.14:...
library/libtasn1
    0.5.11,5.11-0.151.0.1:...    -> 0.5.11,5.11-0.151.0.1.14:...
runtime/python-26
    2.6.4,5.11-0.151.0.1:...     -> 2.6.4,5.11-0.151.0.1.14:...
system/file-system/zfs
    0.5.11,5.11-0.151.0.1.11:... -> 0.5.11,5.11-0.151.0.1.14:...
system/kernel
    0.5.11,5.11-0.151.0.1.13:... -> 0.5.11,5.11-0.151.0.1.14:...
system/kernel/platform
    0.5.11,5.11-0.151.0.1.12:... -> 0.5.11,5.11-0.151.0.1.14:...
system/library
    0.5.11,5.11-0.151.0.1.13:... -> 0.5.11,5.11-0.151.0.1.14:...
system/network/nis
    0.5.11,5.11-0.151.0.1.8:... -> 0.5.11,5.11-0.151.0.1.14:...

# pkg update --be-name solaris-11e-sru-14
            Packages to update: 15
       Create boot environment: Yes
Create backup boot environment: No

DOWNLOAD               PKGS       FILES    XFER (MB)
Completed          15/15   1743/1743    41.8/41.8

PHASE                                        ACTIONS
Removal Phase                                  58/58
Install Phase                                  52/52
Update Phase                               4258/4258

PHASE                                          ITEMS
Package State Update Phase                     30/30
Package Cache Update Phase                     15/15
Image State Update Phase                         2/2

PHASE                                          ITEMS
Reading Existing Index                           8/8
Indexing Packages                              15/15

A clone of ... exists and has been updated and activated.
On the next boot the Boot Environment solaris-11e-sru-14 will be
mounted on '/'. Reboot when ready to switch to this updated BE.

----------------------------------------------------------
NOTE: Please review release notes posted at:

http://www.oracle.com/pls/topic/lookup?ctx=E23824&id=SERNS
----------------------------------------------------------

# init 6

And this concludes this post.

Wednesday, October 11, 2017

ANSI escape sequences - colors

ANSI escape sequences for color manipulation consists not only on text color codes, but also, on convenient combinations with text attributes codes. Some ANSI codes related to them are:

Text attributes

0 All attributes OFF
1 Bold
2 Dimmed             (odd with magenta)
4 Underscore         (monochrome displays only)
7 Reversed           (foreground/background)
8 Hidden             (concealed)
9 Stroke-through

Text Colors

        Foreground Background
Black           30         40
Red             31         41
Green           32         42
Yellow          33         43
Blue            34         44
Magenta         35         45
Cyan            36         46
White           37         47

The escape sequence, as a whole, begins by the escape character, which is the ASCII 27 (0x1B) character, followed by the [ character, followed by a semicolon-separated ANSI codes list as per the above tables, finally ending on a lowercase m. That is:

Esc[<code-1>;...;<code-n>m

But depending on where it's used, the initial escape character (Esc) or the sequence itself is indicated in particular ways. For instance, even in various components of BASH it varies as follows:

\[\e[ANSI-codes-list\] - In PS1 and PS2 prompts definition
    \e[ANSI-codes-list   - In .inputrc definitions
    ^[[ANSI-codes-list   - On scripts (^V+Esc in VIM generates ^[)

As a real example, see the following excerpt from a PS1 definition:
(an expanded example is found at Shell initialization files)

O='\[\e[0m\]'        # all off
B='\[\e[0;1m\]'      # bright / bold
Y='\[\e[0;33m\]'     # yellow / orange

PS1="$O\n"
PS1="$PS1"$([[ "$LOGNAME" != root ]] && echo "$Y\u$B@")
PS1="$PS1$Y\h$B:$O\w\n$B\\\$$O "
export PS1

It will produce something similar to:

user1@host:/tmp
$ _

Many interesting things can be done with ANSI escape sequences.
There's also some more advanced features such as:

Cursor positioning
Clearing the screen, the line and so on...

Tuesday, October 10, 2017

Disk naming

This topic may be technically verbose due to its nature.
In the past this knowledge was much more common and relevant.
Nowadays new technologies such as ZFS have greatly simplified matters.

Anyway, when dealing with local hardware (interfaces, buses, controllers, disks) and even remote storage (SAN) it's inevitably necessary to possess at least a basic understanding of the overall scheme adopted by Solaris when abstracting these kind of peripheral devices and remote storage resources.

I've always found this topic somewhat difficult to understand because no-one had enough patience or sufficiently broad understanding at the time to present me a reasonable big picture of things. That's why I'm now attempting to fulfill this gap by writing down my current knowledge on this matter and hopefully better paving the way for others interest on this subject too.

To summarize, all or most of these details get reflected on the disk naming schemes adopted by the system which one perceives while dealing with disks devices for ZFS storage pools. The most elementary example comes from the format command. For instance:

# echo |format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c1t0d0 ...
          /pci@0,0/pci8086,2829@d/disk@0,0
       1. c1t1d0 ...
          /pci@0,0/pci8086,2829@d/disk@1,0
       2. c1t2d0 ...
          /pci@0,0/pci8086,2829@d/disk@2,0
       3. c1t3d0 ...
          /pci@0,0/pci8086,2829@d/disk@3,0
       4. c1t4d0 ...
          /pci@0,0/pci8086,2829@d/disk@4,0
       5. c4t3d0 ...
          /pci@0,0/pci1000,8000@14/sd@3,0
       6. c5t7d0 ...
          /pci@0,0/pci1000,8000@16/sd@7,0

At the very first time this information may seem quite complex!

What's the meaning of c1t0d0 ?
Worse, what the heck is /pci@0,0/pci8086,2829@d/disk@0,0 ?

Well, to start:

The first is a logical (disk) name.
A system-wide convenient disk name simplification.
Actually, a symbolic link at /dev/rdsk mapping into /devices.
The second is a physical (device) name.
A file path abstraction associated to a hardware component.
Such paths are rooted at the /devices special file-system.

LOGICAL NAMES

Let's first dig into the simpler logical names.
Logical names obey the more simple format:

c?[t?]d?[s?|p?]

Where each portion identifies the following:

c - controller # (logical)
    Some identification assigned by the system.
    Not necessarily related to any physical ordering.

t - bus target # (physical)
    Optional: present only on bus-oriented controllers.
    Not present for direct controllers (which don't have a bus).
    Important to SCSI, iSCSI, SAS and SATA.

d - LUN # (physical)

    Always 0, except when many LUNs exist on a single bus.
    For SCSI, SAS and SATA, always 0.

s - slice #
    Usually from 0 thru 6. Optional. Mutually exclusive with p.
    Slice 2 (s2) "conventionally" means the whole disk.
    Legacy; important for old SMI (VTOC) disks.

p - fdisk partition # (i86pc only)
    Ranging from 0 thru 4. Optional. Mutually exclusive with s.
    Partition 0 (p0) means the whole disk.
    Legacy; used with PCFS on some USB devices.

PHYSICAL NAMES

A physical name is much more difficult to understand, not necessarity with respect to its format, but to its relationship in system internals.

One can start by looking up the logical name:

# l /dev/rdsk/c1t0d0
l... ... /dev/rdsk/c1t0d0
-> /devices/pci@0,0/pci8086,2829@d/disk@0,0:wd,raw

Looking at the physical name, in reverse order, one sees:

The raw suffix.
Any references to this entry means using the the device as a character device, in other words, as in a byte-by-byte mode.This raw view of the device is depicted by the letter r in rdsk device path name component.
The wd suffix.
I deduce this is an abbreviation to whole-disk.
This deduction comes from the following straightforward observation:

# prtconf -v |ggrep -B2 /dev/rdsk/c1t0d0 \
|grep -v spectype \
|tr -s ' '
dev_path=/pci@0,0/pci8086,2829@d/disk@0,0:a,raw
dev_link=/dev/rdsk/c1t0d0s0
--
...
--
dev_path=/pci@0,0/pci8086,2829@d/disk@0,0:g,raw
dev_link=/dev/rdsk/c1t0d0s6
--
dev_path=/pci@0,0/pci8086,2829@d/disk@0,0:q,raw
dev_link=/dev/rdsk/c1t0d0p0
--
...
--
dev_path=/pci@0,0/pci8086,2829@d/disk@0,0:u,raw
dev_link=/dev/rdsk/c1t0d0p4
--
dev_path=/pci@0,0/pci8086,2829@d/disk@0,0:wd,raw
dev_link=/dev/rdsk/c1t0d0
A series of device path components denoting hardware nodes.
The general form is: hardware-node@address.
In the example:

disk@0,0
pci8086,2829@d
pci@0,0

The disk component is the most important one.
Its general form is: disk@bus-target-#,LUN-#.

Furthermore, the whole path is associated to a device-driver:

# grep /pci@0,0/pci8086,2829@d/disk@0,0 /etc/path_to_inst
"/pci@0,0/pci8086,2829@d/disk@0,0" 0 "sd"

The device-driver name could also have been found by inspecting the entry's major-number (driver), 214, present on the special devfs file-system type of /devices:

# l /devices/pci\@0\,0/pci8086\,2829\@d/disk\@0\,0:wd,raw
c... 214, 7 ... /.../pci@0,0/pci8086,2829@d/disk...

# grep 214 /etc/name_to_major |cut -d' ' -f1
sd

Just for completeness, in the example above, note that 7, to the right of 214, is the minor-number, denoting the driver instance unique id for that specific device.

The controller's assigned # is undocumented, hence it's nearly impossible to fully understand something about that. With 4 SATA, 1 SCSI and 1 SAS disk, one gets:

# prtconf -v |ggrep -B5 /dev/cfg \
|ggrep -E -v 'spectype|\(' \
|tr -s ' '
dev_path=/pci@0,0/pci8086,2829@d:devctl
dev_path=/pci@0,0/pci8086,2829@d:0
dev_link=/dev/cfg/sata0/0
dev_path=/pci@0,0/pci8086,2829@d:1
dev_link=/dev/cfg/sata0/1
dev_path=/pci@0,0/pci8086,2829@d:2
dev_link=/dev/cfg/sata0/2
dev_path=/pci@0,0/pci8086,2829@d:3
dev_link=/dev/cfg/sata0/3
dev_path=/pci@0,0/pci8086,2829@d:4
dev_link=/dev/cfg/sata0/4
--
dev_path=/pci@0,0/pci1000,8000@14:devctl
dev_path=/pci@0,0/pci1000,8000@14:scsi
dev_link=/dev/cfg/c4
--
dev_path=/pci@0,0/pci1000,8000@16:devctl
dev_path=/pci@0,0/pci1000,8000@16:scsi
dev_link=/dev/cfg/c5

That's a reasonable kick-off, isn't it?

Friday, October 6, 2017

ZFS disk preparation

ZFS disk preparation is kind of legacy topic as more and more newer systems start to support EFI (GPT) labeling for disks on the ZFS root pool. So this post is about the legacy SMI (VTOC) labeled disks, not EFI (GPT) labeled disks. In general, the old scheme may still appear in the years to come, so better take some some notes on it, just in case.

There are lots and lots of perfect information on official documentation, good books, articles and posts, but still not as simple and straightforward as one would probably desired. Frequently, one has to dig into lots of information until finding the exact steps that fit the bill. Hence, I'll take some time in trying to add some little more contribution for fixing these shortcomings.

NOTE

I need to recall that the term slice and partition always refers to the same thing under a SPARC platform, but not so under an i86pc platform when dealing with the SMI (VTOC) scheme. On the later platform, a slice is implicitly understood as a sub-partition and a partition is usually spoken as fdisk partition. History tells this was so to help multi-boot coexistence with other systems, which is fully deprecated. As good practice (and sanity check) keep all slices on the same (primary) partition and rest in peace. With this good practice, most of the burden goes away and one can once more interchangeably use the terms partition and slice.

The goal is to set up an appropriate partition (slice) map in order to assure that a single mountable partition covers the whole disk, its maximum usable area. This is important in order to get the most out of ZFS by letting it enable disks' local caches and get rid of other alien coexistence. Therefore, this addresses the bold recommended to dedicate whole disks to ZFS.

For instance, let's say that the disk c8t1d0 is to be prepared in order to establish a mirrored root pool for a certain system (by the way, disk naming schemes are not part of this post). Let's also assume, at this moment and for the sake of simplicity, that the disk is already recognized by the system (usually said to be configured or available), hence listed by the format utility. Under these assumptions, one way (there are variations) to prepare it for the ZFS root pool is as follows:

# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
     0. c8t0d0 <...>
        /pci@0,0/pci8086,2829@1f,2/disk@0,0
     1. c8t1d0 <...>
        /pci@0,0/pci8086,2829@1f,2/disk@1,0
     ...
Specify disk (enter its number): 1

Under an i86pc platform, if the chosen disk had been previously used, better re-format it and you begin by deleting any current partitions on it and then creating a single SOLARIS2 partition. This can be achieved by choosing the fdisk subcommand, followed by options 3 (as needed and until no alien partition is left), 1 and 6. Then one should get something similar to:

     Total disk size is ... cylinders
     Cylinder size is ... (512 byte) blocks

                                   Cylinders
Partition Status Type         Start End Length %
========= ====== ============ ===== === ====== ===
1         Active Solaris2       1 ... .... 100

SELECT ONE OF THE FOLLOWING:
   1. Create a partition
   2. Specify the active partition
   3. Delete a partition
   4. Change between Solaris and Solaris2 Partition IDs
   5. Edit/View extended partitions
   6. Exit (update disk configuration and exit)
   7. Cancel (exit without updating disk configuration)
Enter Selection: 6

A simpler case happens if originally no alien partitions were present such as when the disk is brand new or have been cleaned up previously. When initially choosing such a disk the format command will display:

...
selecting c8t1d0
[disk formatted]
No Solaris fdisk partition found.
...

Under this condition the fdisk subcommand will again inform the same and ask what should be done, which answer should by y to create a single i86pc partition (here, not a slice) of type SOLARIS2 for the whole disk capacity:

No fdisk table exists.
The default partition for the disk is:

a 100% "SOLARIS System" partition

Type "y" to accept the default partition,
otherwise type "n" to edit the partition table.
y

Although not strictly required at this point, after any of the above cases, it may be of good practice committing the changes so far (the fdisk partition creation also creates a default slice map within the fdisk partition), by doing as follows:

format> label
Ready to label disk, continue? yes

At this point, both for SPARC and i86pc platform, it's necessary to define the partition (slice) 0 (traditionally called the root partition if back on the old age of UFS file-systems) as covering the maximum available disk capacity. Fortunately, it's possible to cover both platform cases with the same sequence of required format subcommands as follows (but the example below is from an i86pc platform):

format> partition
...

format> modify
Select partitioning base:
        0. Current partition table (original)
        1. All Free Hog
Choose base (enter number) [0]? 1

Part     Tag Flag Cylinders      Size         Blocks
0       root wm   0             0      (0/0/0)          0
1       swap wu   0             0      (0/0/0)          0
2     backup wu   0 - 1020   1021.00MB (1021/0/0) 2091008
3 unassigned wm   0             0      (0/0/0)          0
4 unassigned wm   0             0      (0/0/0)          0
5 unassigned wm   0             0      (0/0/0)          0
6        usr wm   0             0      (0/0/0)          0
7 unassigned wm   0             0      (0/0/0)          0
8       boot wu   0 -    0      1.00MB (1/0/0)       2048
9 alternates wm   0             0      (0/0/0)          0

Do you wish to continue creating a new partition
table based on above table[yes]?
Free Hog partition[6]? 0
Enter size of partition '1' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '3' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '4' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '5' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '6' [0b, 0c, 0.00mb, 0.00gb]:
Enter size of partition '7' [0b, 0c, 0.00mb, 0.00gb]:

Part     Tag Flag Cylinders     Size            Blocks
0       root wm   1 - 1020 1020.00MB (1020/0/0) 2088960
1       swap wu   0            0      (0/0/0)          0
2     backup wu   0 - 1020 1021.00MB (1021/0/0) 2091008
3 unassigned wm   0            0      (0/0/0)          0
4 unassigned wm   0            0      (0/0/0)          0
5 unassigned wm   0            0      (0/0/0)          0
6        usr wm   0            0      (0/0/0)          0
7 unassigned wm   0            0      (0/0/0)          0
8       boot wu   0 -    0     1.00MB (1/0/0)       2048
9 alternates wm   0            0      (0/0/0)          0

Okay to make this the current partition table[yes]?
Enter table name (remember quotes): "c8t1d0"

Ready to label disk, continue? y

partition> print
Current partition table (c8t1d0):
Total disk cylinders available: 1020 + 2 (reserved cylinders)

Part     Tag Flag Cylinders     Size         Blocks
0 unassigned wm   1 - 1019 1019.00MB (1019/0/0) 2086912
1 unassigned wm   0            0      (0/0/0)          0
2     backup wu   0 - 1019 1020.00MB (1020/0/0) 2088960
3 unassigned wm   0            0      (0/0/0)          0
4 unassigned wm   0            0      (0/0/0)          0
5 unassigned wm   0            0      (0/0/0)          0
6 unassigned wm   0            0      (0/0/0)          0
7 unassigned wm   0            0      (0/0/0)          0
8       boot wu   0 -    0     1.00MB (1/0/0)       2048
9 unassigned wm   0            0      (0/0/0)          0

partition> label
Ready to label disk, continue? y

partition> quit
format> quit

NOTE

Just for curiosity, the Flags column above can have 4 possible values:

wm : writable-mountable

wu : writable-unmountable

rm : readable-mountable

ru : readable-unmountble

One can also see that under an i86pc, s8 is assigned 1 cylinder, cylinder 0, for holding some boot information, which size, in this particular disk geometry, takes 2048 blocks totaling 1 MB. Therefore, each block is 512 bytes long, which suggests that a block is the same thing of a sector because, in general, it's invariably that long on current disks.

Invoke the format command once more to double-check the results. I have faced a situation that upon doing so, the following message appeared:

Note: detected additional allowable expansion storage space
that can be added to current SMI label's computed capacity.
Select to adjust the label capacity.

In following the above instructions another messaged appeared:
(which I noted that s0 hadn't been updated accordingly)

Expansion of label cannot be undone; continue (y/n) ? y
The expanded capacity was added to the disk label and "s2".
Disk label was written to disk.

So to close the above loop, I repeated all the previously shown steps since the modify sub-command and all seemed fine since then. This time I "gained" just 1 MB, but who knows?

NOTE

It's noticeable that partition 2 (slice s2) overlaps with partition 0 (slice s0). This is not an issue, but forces the use of -f flag with the zfs attach command. It should be said that this slice must be kept in order to make possible an eventual (re)installation of the boot-loader (at least under the i86pc platform).

Now that things are presumably more clear, it can be recognized as simple after all, which, in fact, it is. But one issue remains: it's not automated, thus inefficient or impractical if many systems are to be prepared in a row, such as on a not so small virtualization or cloud infrastructure. Fortunately the procedure can be streamlined, as long as all the disks have equal geometry (which makes sense for deployments in large chunks). What needs to be done is:

Do the manual process once and then save the result as a template:
# prtvtoc /dev/rdsk/c8t1d0s2 >/tmp/vtoc-template
Repeatedly apply the template, preferably via some ordinary scripting strategy:
# fmthard -s /tmp/vtoc-template /dev/rdsk/c?[t?]d?s2

And that's it for SMI (VTOC) labeled disks preparation for ZFS.

Thursday, October 5, 2017

ZFS basic mirroring

Mirroring is a traditional strategy in providing fault-tolerance which became more popular for secondary storage systems, typically for the hard-disks. ZFS improves the strategy by introducing checksums to prevent that eventually corrupted data (due to bit-rot or some other component malfunction) at one side of the mirror gets replicated to the other healthy side of the mirror. This ZFS enhancement has been unique since in general it seems not viable to implement it solely at the physical layer (controller) as it may have dependencies at the logical layer (file-systems). ZFS achieves its goal by abstracting the physical layer into storage pools over which logical datasets (file-systems ans raw volumes) are managed.

Establishing mirrors within storage pools is a relatively simple task, specially in more recent versions of Solaris such as the Solaris 11.x. But in late Solaris 10 U1x as well as in Solaris 11 Express some initial disk preparation was required. In addition, for root pools under these older systems, it was necessary to manually install (via installboot(1M) or installgrub(1M)) the boot-loader on new disks just integrated into a mirror. On more recent versions of Solaris the bool-loader management for mirrored root pools were automated yet eventually manually manageable via the install-bootloader sub-command of bootadm(1M).

Another usual difference contrasting older systems (Solaris 10 U1x and Solaris 11 Express) from newer Solaris 11.x is as how underlying disks comprising storage pools are seem with respect to disk-labeling: SMI (VTOC) for older systems and disks and EFI (GPT) for newer ones. The most important implications about these two types of label is that SMI labels impose a limit of 2 TB of usable storage even on larger disks and are the only supported label for root pools under older systems. Typically, referring to whole-disks (which are preferred for ZFS over legacy slices/partitions), when a SMI label is used, disks names take the form c?[t?]d?s0 , otherwise they lack the trailing s0.

Here is some exemplification on how to hassle-free successfully establish a basic mirror on storage pools initially consisting on a single disk:

1) When SMI-labeled disks are required for a pool:

This is typical for older systems in general, for some SPARC systems or for not-so-old systems that don't yet support EFI devices on the root pool.

I assume that the disks were already appropriately prepared.

# zpool status
pool: rpool
state: ONLINE
scan: ...
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          c8t0d0s0    ONLINE       0     0     0

errors: No known data errors

# zpool attach -f rpool c8t0d0s0 c8t1d0s0
Make sure to wait until resilver is done before rebooting.

# zpool status
pool: rpool
state: ONLINE
status: One or more devices is currently being resilvered.
        The pool will continue to function,
        possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since ...
    1.50G scanned out of 3.78G at ...M/s, 0h2m to go
    1.50G resilvered, 39.70% done
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            c8t0d0s0 ONLINE       0     0     0
            c8t1d0s0 ONLINE       0     0     0 (resilvering)

errors: No known data errors

As this is a root pool, when the resilver is complete, one can optionally make sure the boot-loader is properly installed on the newly attached disk as well. But according to the official documentation, this extra step is only mandatory when a zpool replace command is issued on the root pool. For an i86pc system, if one decide so, the command would be similar to:

# installgrub \
/boot/grub/stage1 /boot/groub/stage2 \
/dev/rdsk/c8t1d0s0

or the newer and far superior:

# bootadm install-bootloader

2) Systems supporting EFI-labeled disks for any kind of pool:

This is good news as no tedious disk preparation is required beforehand at all and moreover is totally useless as during an attachment the disk will be automatically formatted and labeled as necessary and accordingly.

Therefore, the attachment procedure is as simple as:

# zpool attach rpool c1t0d0 c1t1d0

NOTE

It's possible to have N disks in a mirror which means the mirror will withstand as many as N-1 members failing at a given time. This may seem highly exaggerated at first but it may make sense on some scenarios.

But let me exclude the case of a 3-way mirror for a root pool with over 2 TB disks as an insane case: a root pool should really never require that much space justifying a 3rd member preventing a double-fault while resilvering from a single-fault.

For instance, a N-way mirror (N>3) for a non-root critical pool may make sense and be an straightforward solution if one intends to keep critical data replicated at N-2 remote locations. The mirrored devices (not disks) forming this pool could be iSCSI LUNs from separate remote storage facilities (preferably also backed by ZFS) as long as each LUN isn't comprised of many individual disks and as long as the pool also keeps local log and cache devices indispensable for better equalizing disparate remote storage performances and link latencies.

NOTE

Mirrors can be created or added right from the start with a single command, such as:

# zpool create hq \
mirror c0t0d0 c1t0d0 \
mirror c0t1d0 c2t0d0

(each mirror above will resist a single disk and controller failure)
(it's similar to RAID-10 but RAID-Z(1) could rival if I/O block is over 128KB)

# zpool add hq \
mirror c1t1d0 c2t1d0

(the hq pool above is now stripping over 3 2-way mirrors)
(a better solution could be a RAID-Z2 scheme depending on block size)

Each mirror on the example above is known as a vdev.
Not surprisingly, ZFS stripes I/O along the top-level vdevs.
By the way, root pools support just 1 mirror vdev.

To remove a device from a mirror:
# zpool detach rpool c1t1d0

To replace a device in a mirror:
# zpool replace rpool c1t1d0 c1t2d0

And that seems the pretty much basics.

Tuesday, October 3, 2017

IPMP basics

IPMP is an acronym for IP multi-path which roughly means resilience in terms of IP connectivity by means of some sort of redundancy provided by multiple paths of communication. This resilience is also commonly referred to as fault-tolerance In fact, multi-path is a general strategy concept used for resilience in critical subsystems. Another example could be MPIO which stands for multi-path I/O, but that's another story.

A necessary consequence of multiples paths is that performance can be enhanced as well as streaming data can flow through multiple paths in parallel. But, by the connection-oriented nature of TCP/IP this performance enhancement frequently narrows down to outbound traffic, that is, traffic flowing out of the IPMP system to remote clients.

IPMP has been available since older Solaris releases and I would say it has become progressively better and simpler to configure since its inception. According to another post of mine called Legacy & Future I'll be focusing on Solaris 11 as my discussions cut-point. Things started to get significantly simpler and better since Solaris 11 Express and really top-notch onward with Solaris 11.x.

I could only talk about Solaris 11.x but I'll address Solaris 11 Express because it's still a nice back-end system capable of running on x86 (32-bits) platform. As everybody knows, beyond mid-range and high-end big-iron SPARC systems, Solaris 11.x only runs on x86-64 (64-bits) platforms. Oracle has completely dropped support to Solaris 11 Express as it was marketed as a short-term transition from Solaris 10 to Solaris 11. The last update was SRU-13 or SRU-14 (focused on some Engineered-Systems). But the truth is that Solaris 11 Express is an awesome system to the near-zero or very small IT budgets business models based on legacy x86 hardware, and it still rivals much more recent Linux and BSD alternatives because it embeds very advanced key technologies such as ZFS and BEs (boot-environments), beyond, of course, other high-end technologies such as IPMP. So if you still have this piece software consider using it, specially because it's quite possible to independently update some of its crucial components and applications based on open software.

Back to IPMP the central idea is to group a given number of network interfaces and associate it to a pool of new (data) addresses by which the group will be publicly accessible. The group is materialized as a new network interface in the system which operation and availability is provided by the collaboration of the underlying group members. In general the number of members network interfaces should greater than the number of data addresses and some member network interfaces can each be set as a hot stand-by to the group. When stand-by network interfaces are present the IPMP group is said to be of an active-standby type, otherwise it's an active-active type. Unless you really have lots of network interfaces to spare an active-standby IPMP type would waste a precious network resource, hence otherwise prefer an active-active IPMP group.

NOTE

Sometimes there's some confusion, argumentation and comparison to another technology know as Link-Aggregation but things are quite different beats although both contribute to resilience and performance. One advantage of IPMP is that it operates on layer-3 thus possessing no layer-2 special driver and hardware requirements as Link-Aggregation dos. Both are not mutually exclusive and can even be combined, but perhaps each one is better suited to an specific scenario or requirement. For instance, a back-to-back connection between two servers is better implemented via Link-Aggregation while out-bound traffic load spread may be better deployed via IPMP.

Let's go straight to a minimal practical example, first on Solaris 11 Express and then on Solaris 11.3. Don't be fooled by the simplicity because the solution is still quite a lot powerful and significant to a many applications infrastructure models which is not easily attained, if at all, by more modern competitors systems. By the way, I will assume that some techniques and technologies (NCP, routes and name resolution) described for manual wired connections will be implicitly used as needed.

EXAMPLE:

Setting up an active-active IPMP group from interfaces net2 and net3 which link names have been respectively renamed from an e1000g2 and an e1000g3 originally available on the system.

# dladm show-phys
LINK      MEDIA         STATE      SPEED DUPLEX    DEVICE
...
net2      Ethernet      unknown    0      half      e1000g2
net3      Ethernet      unknown    0      half      e1000g3
...

The newly generated network interface representing the new IPMP group will stop working only if both net2 and net3 fail simultaneously, but as long as both underlying interfaces remains operational up to 2 Gbps of overall outbound bandwidth will be available for multiple TCP connections, letting clear that still no more that 1 Gbps inbound bandwidth per single TCP connection.

NOTE

It may still not be crystal clear, but having N underlying interfaces of 1 Gbps on a given IPMP group will generally provide an overall outbound bandwidth of N Gbps for that IPMP group. The inbound bandwidth is a different story; if M < N data addresses are configured for an IPMP group of 1 Gbps underlying interfaces, then the overall inbound performance will still be limited to 1 Gbps per TCP session inbound traffic even though it may be possible to simultaneously have M such sessions.

On Solaris 11 Express:

Under Solaris 11 Express the update of the IPMP management interface is still transitioning and its crucial parts still must be managed via the old ifconfig command. Do not attempt to manage the underlying interfaces net2 and net3 via the new ipadm command for anything related to IPMP.

Configure the IPMP group and its data-address.
The group will subsequently receive the underlying member interfaces:

# ifconfig ipmp0 ipmp 192.168.1.230/24 up

Configure the underlying interfaces:

# ifconfig net2 plumb group ipmp0 up
# ifconfig net3 plumb group ipmp0 up

NOTE

In the case of an active-standby configuration it would be necessary to choose one of the underlying interfaces as a stand-by interface by simply inserting the standby keyword just before the up keyword.

Verify the configuration:

# ifconfig -a |ggrep -A2 'ipmp0:'
ipmp0: flags=8001000842<UP,BROADCAST,RUNNING,MULTICAST,IPv4,IPMP> ...
        inet 192.168.1.230 netmask ffffff00 broadcast 192.168.1.255
        groupname ipmp0

# ifconfig -a |ggrep -A3 -E 'net(2|3):'
net2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> ...
      inet 0.0.0.0 netmask ff000000
      groupname ipmp0
      ether 8:0:27:fe:f6:44
net3: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> ...
      inet 0.0.0.0 netmask ff000000
      groupname ipmp0
      ether 8:0:27:c3:94:2

# ipadm show-if |ggrep -E 'ipmp0|net2|net3'
ipmp0 ok   bm--I-----4- ---
net2   ok   bm--------4- ---
net3   ok   bm--------4- ---

# ipadm show-addr 'ipmp0/'
ADDROBJ    TYPE     STATE ADDR
ipmp0/?    static   ok     192.168.1.230/24

# ipmpstat -g
GROUP   GROUPNAME STATE FDT INTERFACES
ipmp0   ipmp0      ok     --   net3 net2

# ipmpstat -i
INTERFACE ACTIVE GROUP FLAGS   LINK PROBE    STATE
net3      yes    ipmp0 ------- up   disabled ok
net2      yes    ipmp0 --mb--- up   disabled ok

Make the configuration persistent across reboots:
(the order of parameters below is important in obtaining the exact results)

# cat /etc/hostname.ipmp0
ipmp 192.168.1.230/24 up

# cat /etc/hostname.net2
group ipmp0 up

# cat /etc/hostname.net3
group ipmp0 up

To eventually disable and clean up the IPMP group:

# rm /etc/hostname.net3
# rm /etc/hostname.net2
# rm /etc/hostname.ipmp0

# ifconfig ipmp0 down
# ifconfig net2 down
# ifconfig net3 down

# ifconfig net2 group ""
# ifconfig net3 group ""

# ifconfig net2 unplumb
# ifconfig net3 unplumb
# ifconfig ipmp0 unplumb

On Solaris 11.3:

Under Solaris 11.3 things are somewhat easier. The IPMP management has been fully integrated into the ipadm command and the case for persistency across reboots is on by default requring no additional actions.

Configure the underlying interfaces:

# ipadm create-ip net2
# ipadm create-ip net3

Configure the IPMP group:

# ipadm create-ipmp -i net2,net3 ipmp0

Set the data address for the IPMP group:

# ipadm create-addr -T static -a 192.168.1.230/24 ipmp0
ipmp0/v4

NOTE

Unfortunately, perhaps due to some subtle bug in the GA release of Solaris 11.3, it seems safer to only set the IPMP group data-address after the underlying interfaces have been added to the IPMP group.

To eventually disable and clean up the IPMP group:
(the order is important, again, due to some subtle bug)

# ipadm delete-addr ipmp0/v4
# ipadm remove-ipmp -i net2,net3 ipmp0
# ipadm delete-ipmp ipmp0
# ipadm delete-ip net2
# ipadm delete-ip net3

In the rare case where a standby underlying interface is still desired, for instance net4, it suffices to perform the following commands:

# ipadm create-ip net4
# ipadm set-ifprop -p standby=on -m ip net4
# ipadm add-ipmp -i net4 ipmp0

# ipadm show-if
IFNAME   CLASS    STATE   ACTIVE OVER
lo0      loopback ok      yes    --
ipmp0    ipmp     ok      yes    net2 net3 net4
net2     ip       ok      yes    --
net3     ip       ok      yes    --
net4     ip       ok      no     --

# ipmpstat -g
GROUP    GROUPNAME STATE FDT INTERFACES
ipmp0    ipmp0      ok     --   net3 net2 (net4)

That's all very powerful and not that difficult to set up.
For sure one more cool technology available in Solaris!