Friday, July 12, 2013

MPSS & large chunks of memory

Solaris as a high-end Unix can handle lots of memory.
I just say large for something that other OS might consider huge.
Nevertheless, a good application ought to cooperate with the system.
One way of coping with that is obtaining specialized chunks of memory.
This strategy can be an important overall optimization.

Solaris is not only a high capacity system but also a very flexible one.
In fact, it has to be, otherwise it wouldn't sustain an almost linear scalability.
Examples of that, related to main memory, are (D)ISM and MPSS.

It seems very obvious to allocate large chunks of memory for specialized purposes. The problem is implementing it properly. Part of the challenge is knowing what the underlying system provides, in this case, hopefully, Solaris. For instance, it's not difficult in Solaris to obtain a large chunk of locked memory, that's a multiple of the largest supported virtual memory page size, which is always properly aligned to the strictest C++ data type of the platform.

ALTERNATIVE 1

Allocating an ISM segment very conveniently and automatically addresses this strategy as it might have been partially revealed on ISM run sample 3.0. Note that for the ISM specific case the respective sample code didn't need any explicit locking control, that is, it didn't require the proc_lock_memory privilege, neither the mlock() / munlock() system calls.

ALTERNATIVE 2

What's also very cool in Solaris is that you can also take advantage of this strategy for the program's heap, by manually and dynamically:
  1. Reconfiguring the virtual memory page size that backs up the program's heap to the platform's largest supported value;
     
  2. Allocating a large chunk of memory that is aligned to and is a multiple of the platform's largest supported virtual memory page size;
     
  3. Optionally, locking all the large virtual memory pages comprising the chunk so that no paging ever happens. This consumes more main memory but may improve performance and avoid the corresponding swap space reservation.
      
Example:

template< typename T >
inline T * tmp_array( std::size_t const n )
{
    return static_cast< T * >( ::alloca( n * sizeof( T ) ) );
}

std::size_t largest_pagesize()
{
    std::size_t largest = ::sysconf( _SC_PAGESIZE );

    int n = ::getpagesizes( NULL, 0 );
    std::size_t * size = tmp_array< std::size_t >( n );

    if ( ::getpagesizes( size, n ) != -1 )
        while ( --n >= 0 )
            if ( size[ n ] > largest )
                largest = size[ n ];

    return largest;
}

inline bool hat_advise_bssbrk( std::size_t const pagesize )
{
    ::memcntl_mha mha;

    mha.mha_cmd = MHA_MAPSIZE_BSSBRK;
    mha.mha_flags = 0;
    mha.mha_pagesize = pagesize;

    return 
        ::memcntl
        ( 
            0, 
            0,
            MC_HAT_ADVISE,
            reinterpret_cast< caddr_t >( & mha ),
            0,
            0 
        )
        == 0;
}

// A friendly usage pattern.
// Doesn't handle any exceptions.
void f() 
{
    // Chunk size should be a large-page size multiple
    std::size_t const pagesize = largest_pagesize(); 
    std::size_t const size = 512 * pagesize;

    // Advise HAT to adopt a large-page size
    // May trigger fix-up viewable with pmap -xs 
    // So from now on the heap will use large-page sizes
    if ( !hat_advise_bssbrk( pagesize ) )
        ::perror( NULL );

    // Friendly reserve a large-page multiple (chunk) from heap
    // May trigger additional fix-up viewable with pmap -xs 
    // Reserved address range will also be viewable 
    void * p = ::memalign( pagesize, size );

    // Optional, but useful to prevent swapping
    if ( ::mlock( p, size ) != 0 )
        ::perror( NULL );

    // Touch the reserved memory 
    // Triggers the actual allocation
    ::memset( p, '*', size );

    if ( ::munlock( p, size ) != 0 )
        ::perror( NULL );

    // Done
    ::free( p );
}

ALTERNATIVE 3

The previous alternative may be too invasive or extensive as the whole program's heap is reconfigured for a large-page size. With the previous approach I can't leave just the allocated (special) chunk backed by large pages. This is probably not what's desired for other ordinary casual allocations. Furthermore, trying to revert the heap pages back to the defaults, negatively affects previous (special) large-page allocations.

Thanks again to Solaris the solution is easy.
Consider the following slight variations from the previous code:

inline bool hat_advise_va

    void const * const p,
    std::size_t const size,
    std::size_t const pagesize 
)
{
    ::memcntl_mha mha;

    mha.mha_cmd = MHA_MAPSIZE_VA;
    mha.mha_flags = 0;
    mha.mha_pagesize = pagesize;

    return 
        ::memcntl
        ( 
            static_cast< caddr_t >
                ( const_cast< void * >( p ) )
            size,
            MC_HAT_ADVISE,
            reinterpret_cast< caddr_t >( & mha ),
            0,
            0 
        )
        == 0;
}

// A typical usage pattern.
// Doesn't handle any exceptions.
void f() 
{
    // Chunk size should be a large-page size multiple
    std::size_t const pagesize = largest_pagesize(); 
    std::size_t const size = 512 * pagesize;

    // Friendly reserve a large-page multiple (chunk) from heap 
    // May trigger additional fix-up viewable with pmap -xs 
    // Reserved address range will also be viewable 
    void * p = ::memalign( pagesize, size );

    // Advise HAT to adopt a large-page size for the chunk
    // May trigger fix-up viewable with pmap -xs 
    if ( !hat_advise_va( p, size, pagesize ) )
          ::perror( NULL );

    // Optional, but useful to prevent swapping
    if ( ::mlock( p, size ) != 0 )
        ::perror( NULL );

    // Touch the reserved memory
    // Triggers the actual allocation
    ::memset( p, '*', size );

    // May immediately page-out parts of chunk 
    if ( ::munlock( p, size ) != 0 )
        ::perror( NULL );

    // Done
    ::free( p );
}

By the way, if I want to check the page size backing a certain region:

inline std::size_t pagesize( void const * const p )
{
    uint_t const request = MEMINFO_VPAGESIZE;

    uint64_t output;
    uint_t validity;

    if ( ::meminfo
         (
            reinterpret_cast< uint64_t const * >( & p ),
            1,
            & request,
            1,
            & output,
            & validity
         )
         == 0
       )

        // Is p a valid virtual address?
        if ( validity & 1 )

            // Has the virtual address been touched?
            // Is there any memory page backing it?
            if ( validity & 2 )
                return output;

    // No page is backing the virtual address
    return 0;
}

For exemplifying the behavior of the code compiled (-g -m64) on an Intel x64, from a debugging session, let's take excerpts of the heap life-cycle from a series of:

$ pmap -xs `pgrep <program>` | head -25

The program's heap starts as follows:

         Address  Kbytes     RSS    Anon  Locked Pgsz Mode  Mapped File
0000000000400000      16      16       -       -   4K r-x-- ...
0000000000413000       4       4       4       -   4K rw--- ...
0000000000414000      36      36      36       -   4K rw---   [ heap ]
000000000041D000       4       -       -       -    - rw---   [ heap ]
000000000041E000       8       8       8       -   4K rw---   [ heap ]
0000000000420000      28       -       -       -    - rw---   [ heap ]
0000000000427000       4       4       4       -   4K rw---   [ heap ]
0000000000428000      28       -       -       -    - rw---   [ heap ]
000000000042F000       4       4       4       -   4K rw---   [ heap ]
0000000000430000      28       -       -       -    - rw---   [ heap ]
0000000000437000       4       4       4       -   4K rw---   [ heap ]
FFFF80FFB8EB0000       4       4       -       -   4K r-x-- ...
...

After the call to ::memalign():

         Address  Kbytes     RSS    Anon  Locked Pgsz Mode  Mapped File
...
0000000000437000       4       4       4       -   4K rw---    [ heap ]
0000000000438000    1820       -       -       -    - rw---    [ heap ]
00000000005FF000       4       4       4       -   4K rw---    [ heap ]
0000000000600000 1048576       -       -       -    - rw---    [ heap ]
0000000040600000       4       4       4       -   4K rw---    [ heap ]
0000000040601000     216       -       -       -    - rw---    [ heap ]
0000000040637000       4       4       4       -   4K rw---    [ heap ]

FFFF80FFB8EB0000       4       4       -       -   4K r-x-- ...
...

After the call to ::mlock():

         Address  Kbytes     RSS    Anon  Locked Pgsz Mode  Mapped File
...
00000000005FF000       4       4       4       -   4K rw---    [ heap ]
0000000000600000 1048576
1048576       - 1048576    - rw---    [ heap ]
0000000040600000       4       4       4       -   4K rw---    [ heap ]
...


After the call to ::memset():

         Address  Kbytes     RSS    Anon  Locked Pgsz Mode  Mapped File
...
00000000005FF000       4       4       4       -   4K rw---    [ heap ]
0000000000600000 1048576 1048576 1048576 1048576   2M rw---    [ heap ]
0000000040600000       4       4       4       -   4K rw---    [ heap ]
...


After the call to ::munlock():

         Address  Kbytes     RSS    Anon  Locked Pgsz Mode  Mapped File
...
00000000005FF000       4       4       4       -   4K rw---    [ heap ]
0000000000600000  563200  563200  563200       -   2M rw---    [ heap ]
0000000022C00000    8192    8192       -       -    - rw---    [ heap ]
0000000023400000  477184  477184  477184       -   2M rw---    [ heap ]

0000000040600000       4       4       4       -   4K rw---    [ heap ]
...


Among other things, note that the next available heap address lies on a default page size, in the case of this system, a 4K page, instead of the largest page size, again, in the case of this system, a 2M page. This is exactly the fine-grained manual and dynamic control I was looking for, thanks to Solaris, of course!
 

Thursday, July 4, 2013

Nested C++ template specialization

C++ template specialization can be a rather advanced topic.
It's much used on meta-programming to provide optimizations.

It's better to avoid nesting what's already difficult.
Coping with complexity isn't necessarily the best solution.
Sorry if I disappoint you.

As an example of what I'm say, find below a quick example that will help optimize natural numbers exponentiation. It's slightly more elaborated than the one provided on C++ Templates: The Complete Guide by Nicolai Josuttis and Daveed Vandevoorde.

The first part is the "nested" meta-program:

template< unsigned long long B, unsigned char P >
struct exponentiation
{
    enum
    {
        result = B * exponentiation< B, P - 1 >::result
    };
};

template< unsigned long long B >
struct exponentiation< B, 0 >
{
    enum
    {
        result = 1ULL
    };
};

The second part is (perhaps) a convenience wrapping:

template< unsigned long long B >
struct base
{
    template< unsigned char P >
    struct power
    {
        enum { result = exponentiation< B, P >::result };
    };
};

The usage should be obvious:

char buffer[ base<2>::power<56>::result ];

The maximum compile-time capacity of Oracle Solaris Studio 12.3 for Solaris 11.1 running on Intel x64 is for computing the power of 2 to 61:

2305843009213693952