Learnings on Solaris™: Memory

Showing posts with label Memory. Show all posts

Monday, April 10, 2017

The physical memory

The physical memory is a crucial and precious resource and is commonly one of the major system bottlenecks as well as one of the system components that bumps a system price to the skies.

Knowing and, better yet, determining at runtime the amount of physical memory that is physically installed on a host and that is actually available to the system is important to many deployment and administration strategies.

Without recurring to programming at the system APIs level, it is possible to easily determine such figures as shown below.

$ prtconf | grep Mem
Memory size: 8192 Megabytes

# echo ::memstat | mdb -k | grep Total
Total            2096958            7.9G

$ kstat -p -n system_pages | egrep 'avail|physmem|locked|total'
unix:0:system_pages:availrmem    930155
unix:0:system_pages:pageslocked 1162706
unix:0:system_pages:pagestotal 2092861
unix:0:system_pages:physmem     2092861

Note that pagestotal = availrmem + pageslocked and that it seems that interestingly pagestotal = physmem, all in multiples of page sizes.

$ pagesize
4096

Then we can now compare things and better grasp the reality:

$ echo "(2092861 * `pagesize`) / 1024 ^ 2" | bc
8175

$ echo "(2096958 * `pagesize`) / 1024 ^ 2" | bc
8191

To me the 16 MB (4096 pages) difference between 8191 and 8175 seems to be fixed (non-pageable) and is still a mystery, a matter to open investigation, perhaps some part of the kernel known only by the internal staff.

That is, according to system's best result it actually sees 8191 MB, 1 MB less than what's physically installed on the host and that's not so hard to wonder why (perhaps set aside for the on-board video or so). Using closer to perfect figures ought to provide more exact results for planning and assessments.

Kernel zones & ZFS ARC

Assuming your system meet sufficient kernel zones support requirements one important tunning is the adjustment of the ZFS ARC maximum bytes (the so known zfs_arc_max in /etc/system). I've done a somewhat similar tunning a couple of years ago as tunning best practice right after installing VirtualBox. For kernel zones it may not be just a case of simple best practice but more likely a be advised or neglect it at your own risk!

By the way, according to more recent Solaris public documentation, the host system sees kernel zones just as another application. The required tuning on the host system should take into account all the kernel zones and processes that are anticipated to run on the system.

In the past, for figuring out the current zfs_arc_max I just relied on the c_max bytes from kstat -n arcstats. But more recently Solaris 11.2 documentation refers to ::memstat from mdb -k. So let's just put them in perspective (remembering that other figures from arcstats may play a role not being considered below):

# kstat -n arcstats | grep c_max
    c_max                           7498616832

# echo ::memstat | mdb -k
Page Summary                 Pages             Bytes %Tot
----------------- ---------------- ---------------- ----
Kernel                      293573              1.1G   14%
ZFS Metadata                 28199            110.1M    1%
ZFS File Data               517332              1.9G   25%
Anon                        269994              1.0G   13%
Exec and libs                 6008             23.4M    0%
Page cache                  328957              1.2G   16%
Free (cachelist)              3779             14.7M    0%
Free (freelist)             628887              2.3G   30%
Total                      2096958              7.9G

# pagesize
4096

To quote the Solaris 11.2 documentation topic:

The suggested value is one-half of what you would like the host ZFS resources to use. For example, if you want ZFS to use less than 2 GB of memory, set the ARC cache to 1 GB, or 0x40000000.

Furthermore the Solaris 11.2 documentation on zfs_arc_max says:

75% of memory on systems with less than 4 GB of memory.
physmem minus 1 GB on systems with greater than 4 GB of memory.

If a future memory requirement is significantly large and well defined, you might consider reducing the value of this parameter to cap the ARC so that it does not compete with the memory requirement. For example, if you know that a future workload requires 20% of memory, it makes sense to cap the ARC such that it does not consume more than the remaining 80% of memory.

But in Solaris 11.3 things start to change a bit. There's a new tunable called user_reserve_hint_pct (from 0 to 99, defaulting to 0, also set in /etc/system as set user_reserve_hint_pct=...) intended to supersede zfs_arc_max. About the new tunable, Solaris 11.3 documentation says:

Informs the system about how much memory is reserved for application use, and therefore limits how much memory can be used by the ZFS ARC cache as the cache increases over time.

By means of this parameter, administrators can maintain a large reserve of available free memory for future application demands. The user_reserve_hint_pct parameter is intended to be used in place of the zfs_arc_max parameter to restrict the growth of the ZFS ARC cache.

If a dedicated system is used to run a set of applications with a known memory footprint, set the parameter to the value of that footprint.

For upward adjustments, increase the value if the initial value is determined to be insufficient over time for application requirements, or if application demand increases on the system. Perform this adjustment only within a scheduled system maintenance window. After you have changed the value, reboot the system.

For downward adjustments, decrease the value if allowed by application requirements. Make sure to use decrease the value only by small amounts, no greater than 5% at a time.

Thursday, July 10, 2014

RAM ZFS pool

It's probably not efficient to have a temporary ZFS pool on top of a RAM disk.
After all the ZFS ARC and L2ARC are very efficient intermediate caches.

Anyway, it's possible to define a temporary ZFS pool on top of a RAM disk.
Perhaps (still have to check) I could disable ARC and L2ARC.
It's also possible that this ZFS pool be encrypted.

At this point, perhaps I could have a case to justify this idea.
But I'm not recommending nor advising anything.
For now, I'm just exploring technologies.

Assume I already have a RAM disk:

# ramdiskadm
Block Device                Size Removable
/dev/ramdisk/c-01      268435456    Yes

On top of this block device I can establish the temporary ZFS pool.
And I want its root dataset to be encrypted right from the start.

# zpool create -R /c-01 -O encryption=on c-01 /dev/ramdisk/c-01
Enter passphrase for 'c-01':
Enter again:

After entering twice the passphrase the temporary ZFS pool will be created.
By default, a passphrase is prompted for encryption.
There's a space overhead for ZFS metadata.

# zpool list c-01
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
c-01 250M   100K 250M   0% 1.00x ONLINE /c-01

# zpool status c-01
pool: c-01
state: ONLINE
scan: none requested
config:

    NAME                     STATE     READ WRITE CKSUM
    c-01                     ONLINE       0     0     0
    /dev/ramdisk/c-01      ONLINE       0     0     0

The -R option automatically sets the altroot and cachefile properties.

# zpool get all c-01
NAME PROPERTY       VALUE                 SOURCE
c-01 allocated      100K                  -
c-01 altroot        /c-01                 local
c-01 autoexpand     off                   default
c-01 autoreplace    off                   default
c-01 bootfs         -                     default
c-01 cachefile      none                  local
c-01 capacity       0%                    -
c-01 dedupditto     0                     default
c-01 dedupratio     1.00x                 -
c-01 delegation     on                    default
c-01 failmode       wait                  default
c-01 free           250M                  -
c-01 guid           14233173042207590481 -
c-01 health         ONLINE                -
c-01 listshares     off                   default
c-01 listsnapshots off                   default
c-01 readonly       off                   -
c-01 size           250M                  -
c-01 version        34                    default

The -O option sets the defaults for encryption.
Also note the default mountpoint for the associated root dataset.

# zfs get all c-01
NAME PROPERTY              VALUE                  SOURCE
c-01 aclinherit            restricted             default
c-01 aclmode               discard                default
c-01 atime                 on                     default
c-01 available             218M                   -
c-01 canmount              on                     default
c-01 casesensitivity       mixed                  -
c-01 checksum              sha256-mac             local
c-01 compression           off                    default
c-01 compressratio         1.00x                  -
c-01 copies                1                      default
c-01 creation              Thu Jul 10 11:20 2014 -
c-01 dedup                 off                    default
c-01 devices               on                     default
c-01 encryption            on                     local
c-01 exec                  on                     default
c-01 keychangedate         Thu Jul 10 11:20 2014 local
c-01 keysource             passphrase,prompt      local
c-01 keystatus             available              -
c-01 logbias               latency                default
c-01 mlslabel              none                   -
c-01 mounted               yes                    -
c-01 mountpoint            /c-01                  local
c-01 multilevel            off                    -
c-01 nbmand                off                    default
c-01 normalization         none                   -
c-01 primarycache          all                    default
c-01 quota                 none                   default
c-01 readonly              off                    default
c-01 recordsize            128K                   default
c-01 referenced            33K                    -
c-01 refquota              none                   default
c-01 refreservation        none                   default
c-01 rekeydate             Thu Jul 10 11:20 2014 local
c-01 reservation           none                   default
c-01 rstchown              on                     default
c-01 secondarycache        all                    default
c-01 setuid                on                     default
c-01 shadow                none                   -
c-01 share.*               ...                    local
c-01 snapdir               hidden                 default
c-01 sync                  standard               default
c-01 type                  filesystem             -
c-01 used                  100K                   -
c-01 usedbychildren        67.5K                  -
c-01 usedbydataset         33K                    -
c-01 usedbyrefreservation 0                      -
c-01 usedbysnapshots       0                      -
c-01 utf8only              off                    -
c-01 version               6                      -
c-01 vscan                 off                    default
c-01 xattr                 on                     default
c-01 zoned                 off                    default

As I said earlier ZFS provides an intelligent and layered caching architecture comprised by the ARC (Adaptive Replacement Cache) and the L2ARC (Layer 2 ARC). Each ZFS dataset (file system, volume, etc...) has two properties for each type of cache, primarycache for ARC and secondarycache for L2ARC. Both properties provide a way to tell the ZFS engine how each cache should be used. Each property admits three possible values: all, metadata and none. As the backing ZFS pool device is a RAM disk I propose to evaluate the effect of setting both properties to none.

# zfs set primarycache=none c-01

# zfs set secondarycache=none c-01

# zfs get primarycache,secondarycache c-01
NAME PROPERTY        VALUE SOURCE
c-01 primarycache    none   local
c-01 secondarycache none   local

The way I've presented above, as I presumed since beginning, this experiment doesn't seem to present any advantage when the system is mostly idle or yet not heavily loaded. But it could be interesting to develop several scenarios in order to provide a better overall evaluation.

Wednesday, July 9, 2014

RAM disks

So RAM disks are still around in Solaris 11.
Well, of course, they are great if used consciously, I suppose.
In other words, this means they shouldn't be abused.

The key points about RAM disks are:

They consume precious physical RAM.
According to ramdisk(7D) they can use up to 25% of RAM;
(as far as I suppose its pages aren't swappable)
(perhaps ipcs(1) and pmap(1) could help as in ISM run sample 3.0)
They are ephemeral.
(they do not persist across reboots)

The main CLI/shell interface is the ramdiskadm(1M) command.
Naturally, at first no RAM disks exist in the system by default:

# ramdiskadm

# ls -al /dev/ramdisk
/dev/ramdisk: No such file or directory

# swap -sh
total: 1.9G allocated + 1.8G reserved = 3.7G used, 6.8G available

To create a RAM disk I must specify its name and size.
If you specify a size that cannot be honored and error message is printed out.
Otherwise, the new device name associated with the disk is printed out.

# ramdiskadm -a cache-01 512m
ramdiskadm: couldn't create ramdisk "cache-01":
Resource temporarily unavailable

# ramdiskadm -a cache-01 256m
/dev/ramdisk/cache-01

# ramdiskadm
Block Device                    Size Removable
/dev/ramdisk/cache-01      268435456    Yes

# swap -sh
total: 1.9G allocated + 1.8G reserved = 3.7G used, 6.5G available

The associated block file name is:

# ls -l /dev/ramdisk
total 1
... cache-01 -> ../../devices/pseudo/ramdisk@1024:cache-01

The associated raw file name is:

# ls -l /dev/rramdisk
total 1
... cache-01 -> ../../devices/pseudo/ramdisk@1024:cache-01,raw

In case the RAM disk must be disposed of before the next reboot:
(perhaps a RAM shortage is taking place and space must be freed)

# ramdiskadm -d cache-01

As an example, a RAM disk could perhaps be used as the raw device for a temporary ZFS pool intended for some very specialized scenario.

Wednesday, July 2, 2014

Limiting /tmp size

It's no doubt that the /tmp file system is very useful and an indispensable item.
But as you know it's based on the system provided swap space.
Thus it's a good idea to limit /tmp.

This is easily done by edint /etc/vfstab.
But I reboot is required in order to changes take effect.

Here's an example of limiting /tmp to 2 GB.
Just edit the corresponding line as below:

$ grep tmp /etc/vfstab
swap - /tmp tmpfs - yes size=2048m

NOTE

But be aware of one caveat when setting a limit: don't set it too low for a normal system operation, that is, before setting a value, take a baseline of the typical space consumption of /tmp and set a reasonable value above it. Otherwise, you'll likely see unexpected error messages in apparently non-related components or subsystems that fail to allocate some space in /tmp.

As an example of failing to identify a baseline before setting a limit, take my experience on setting a 1 GB limit and not noticing /tmp filling up until I get an error message from X screensaver after locking my desktop stating "ftruncate() error: no space left on device". I couldn't find any other indication on /var/adm/messages, SMF services or ZFS pools and file systems that could indicate a device filling up, until I realize the /tmp limit just because I was researching the possibility of encrypting swap, which, by the way, is still unsupported in Solaris 11.1. In fact, in the event of those "mysterious" error messages only DTrace can help spot the source or at least provide a good clue.

Wednesday, April 2, 2014

Back from 2014 vacation

Well, after two months, one of which dedicated to my vacation, I'm back.
Not that anyone cares, of course, but just to make sure I'll carry on.
Having this couple of months away was of great benefit to me.
On the other side, there's a backlog I intend to attack:

A refined version of a 1st specialized C++ memory pool
Advanced C++ smart pointers
Further posts about Mercurial installation
Further posts about DNS installation
Further posts about SMF

Not that everything will flow quickly and soon, after wall, every honest accomplishment almost always require hard work and I'm glad this is the path I've been always looking for.

Saturday, September 21, 2013

Advanced C++ smart pointers

All the development for the simple C++ smart pointers is good for many scenarios, but so far I haven't taken advantage of the so called reference counting. That's the main reason why I had to use the helper relay objects across function calls and why I couldn't consider the STL containers, not to say multi-threading. The helper relay objects critical role may change to passing the pointers across multiple threads (of a same process, of course!).

In what follows, I intend an implementation because that's quite useful in spite of the added overhead. Furthermore, I believe that most implementations, such as Boost's shared_ptr (whose name I don't like for being misleading — what's shared is the pointed to object; not even the ownership), aren't appropriately implemented for Solaris.

For the dynamically allocated multi-threaded reference counter I'll use my specialized memory pool which I believe is flexible and efficient enough to get the job done.

...

Tuesday, September 10, 2013

A 1st specialized C++ memory pool

Specialized C++ memory pools have many important applications. They are not C++ memory allocators, but can be used in their implementation. Solaris provides a quite good and rich support to this which I'll try to take advantage of.

Perhaps the most obvious application is for nodes of varying data structures. Nevertheless, as a particular example of how useful a specialized memory pool is, consider the strategy of reference counting, which is specially useful to smart pointers. The fact is that the counters must be shared by the set of smart pointers pointing to the same object. As I said on another post, this has induced the misnaming of Boost's shared_ptr. But back to the subject, the counter requirement implies that it must be dynamically allocated. The known problem is that using the standard operators ::new and ::delete are quite inefficient, specially for an intended industrial strength version. What's needed is a replacement, such as the slab allocator by Jeff Bonwick or the Bjarne Stroustrup's User-Defined Allocator (The C++ Programming Language Third/Special, section 19.4.2), both based on the idea of caches for efficient constant time O(1) operations.

Furthermore, it should be thread-safe, preferably with non-locking atomic arithmetic. I'll see if it's possible to avoid mutexes and I intend to use atomic operations provided by the Solaris Standard C Library atomic_ops(3C) collection of functions, as indicated by Darryl Gove in his book Solaris Application Programming, section 12.7.4, Using Atomic Operations. In fact, on Multicore Application Programming: For Windows, Linux, and Oracle® Solaris, chapter 8, listings 8.4 and 8.5, also by Darryl Gove I may have the solution: instead of mutexes, use a lock-free variant that loops on the CAS of a local variable.

So, inspired by the above references, I'll start my implementation of a thread-safe and always expanding specialized pool. Internally, it will be comprised of several chunks. My intention is to make the size of each chunk fit a certain page size, which ultimately will imply how many objects slots each chunk will able to cache. The first hurdle is that this is dynamic, depending on the hardware and its multiple supported page sizes among which to choose. As such, internally, I can't declare an array of objects (whose size must be known at compile time), nor can I declare a pointer to objects (as this would decouple the list of chunks from the chunks themselves, incurring on separate dynamic memory management, one for the list and other for the chunks).

The initial (probably incomplete) version of my constant time, O(1), thread-safe, always expanding pool, has two fundamental high-level operations, request() and recycle(), in order to make it publicly useful.

The implementation idea is somewhat simple: Maintain a free list of data slots over the unused payload areas of the buffers comprising the cache.

But achieving this at the code-level isn't as simple because of the performance, space and concurrency constrains. One notorious trade-off is due to the free list pointers sharing the space of unused data slots, which implies that the minimum size of a data slot is the size of a pointer (currently 8 bytes on 64-bit Intel platforms). Thus, for instance, if the data type is int (currently 4 bytes on 64-bit Intel platforms), then 50% of space will be wasted. When there's waste, the situation is known as internal fragmentation. Hence, in terms of space efficiency:

It's best to have the size of the main (pointed to) data type (T)
as a multiple of the platform's pointer (void *) size.

Here's my 1^st implementation attempt:

#include <stdexcept>
#include <cstdlib>
#include <cerrno>

#include <alloca.h>
#include <atomic.h>
#include <unistd.h>
#include <sys/shm.h>
#include <sys/mman.h>
#include <sys/types.h>

namespace memory
{

struct bad_alloc
{
 bad_alloc( int const error ) throw() :
 error( error )
 {
 }

 int const error;
};

namespace policy
{

enum { locked, ism, pageable, dism };

namespace internal
{

// Base class for policy classes of memory management.
struct control
{
 control( std::size_t const page_size ) throw() :
 page_size( page_size )
 {
 }

 void hat_advise_va( void const * p ) const throw()
 {
 ::memcntl_mha mha;

 mha.mha_cmd = MHA_MAPSIZE_VA;
 mha.mha_flags = 0;
 mha.mha_pagesize = page_size;

 if
 (
 ::memcntl
 (
 static_cast< caddr_t >
 ( const_cast< void * >( p ) ),
 page_size,
 MC_HAT_ADVISE,
 reinterpret_cast< caddr_t >( & mha ),
 0,
 0
 )
 != 0
 )
 {
 // Log the error.
 }
 }

 void lock( void const * p ) const throw()
 {
 if ( ::mlock( p, page_size ) != 0 )
 {
 // Log the error.
 }
 }

 void unlock( void const * p ) const throw()
 {
 if ( ::munlock( p, page_size ) != 0 )
 {
 // Log the error.
 }
 }

 // The runtime size of buffers.
 std::size_t const page_size;
};

// Policy class for low-level shared memory management.
// Use non-type template parameters for additional data.
// Template functions to typecast data members.
template< int F >
struct shared : control
{
 shared( std::size_t const page_size ) throw() :
 control( page_size )
 {
 }

 template< typename B >
 void * request() const throw( std::bad_alloc )
 {
 int const handle =
 ::shmget( IPC_PRIVATE, page_size, SHM_R | SHM_W );

 if ( handle == -1 )
 {
 // Log the error.

 throw std::bad_alloc();
 }

 void * p = ::shmat( handle, 0, F );

 if ( ! p )
 {
 // Log the error.

 if ( ::shmctl( handle, IPC_RMID, 0 ) != 0 )
 {
 // Log the error.
 }

 throw std::bad_alloc();
 }

 * const_cast< int * >
 ( & reinterpret_cast( p )->handle ) =
 handle;

 return p;
 }

 template< typename B >
 void recycle( B const * const p ) const throw()
 {
 int const handle = p->handle;

 if ( ::shmdt( ( void * ) p ) != 0 )
 {
 // Log the error.
 }

 if ( ::shmctl( handle, IPC_RMID, 0 ) != 0 )
 {
 // Log the error.
 }
 }
};

} // namespace internal

// Policy class for low-level C++ memory management.
struct cxx : internal::control
{
 cxx( std::size_t const page_size ) throw() :
 internal::control( page_size )
 {
 }

 template< typename >
 void * request() const throw( std::bad_alloc )
 {
 // Unfortunately, in general, not aligned!
 // No point for locking or setting page size.
 // Unfortunately, all bets are off!
 return ::operator new ( page_size );
 }

 template< typename B >
 void recycle( B const * const p ) const throw()
 {
 ::operator delete ( ( void * ) p );
 }
};

// Template policy class for low-level C memory management.
template< int >
struct c;

template<>
struct c< pageable > : internal::control
{
 c( std::size_t const page_size ) throw() :
 internal::control( page_size )
 {
 }

 template< typename >
 void * request() const throw( std::bad_alloc )
 {
 // Solaris Standard C library to the rescue!
 void * p = ::memalign( page_size, page_size );

 if ( ! p )
 throw std::bad_alloc();

 // Advise HAT to adopt a corresponding page size.
 hat_advise_va( p );

 return p;
 }

 template< typename B >
 void recycle( B const * const p ) const throw()
 {
 ::free( ( void * ) p );
 }
};

template<>
struct c< locked > : c< pageable >
{
 c( std::size_t const page_size ) throw() :
 c< pageable >( page_size )
 {
 }

 template< typename B >
 void * request() const throw( std::bad_alloc )
 {
 void * p = c< pageable >::request();

 lock( p );

 return p;
 }

 template< typename B >
 void recycle( B const * const p ) const throw()
 {
 unlock( ( void * ) p );

 c< pageable >::recycle( p );
 }
};

template< int >
struct shared;

template<>
struct shared< ism > : internal::shared< SHM_SHARE_MMU >
{
 shared( std::size_t const page_size ) throw() :
 internal::shared< SHM_SHARE_MMU >( page_size )
 {
 }
};

template<>
struct shared< dism > : internal::shared< SHM_PAGEABLE >
{
 shared( std::size_t const page_size ) throw() :
 internal::shared< SHM_PAGEABLE >( page_size )
 {
 }
};

template<>
struct shared< pageable > : internal::shared< SHM_RND >
{
 shared( std::size_t const page_size ) throw() :
 internal::shared< SHM_RND >( page_size )
 {
 }

 template< typename B >
 void * request() const throw( std::bad_alloc )
 {
 void * p = internal::shared< SHM_RND >::request();

 // Advise HAT to adopt a corresponding page size.
 hat_advise_va( p );

 return p;
 }
};

template<>
struct shared< locked > : shared< pageable >
{
 shared( std::size_t const page_size ) throw() :
 shared< pageable >( page_size )
 {
 }

 template< typename B >
 void * request() const throw( std::bad_alloc )
 {
 void * p = shared< pageable >::request();

 lock( p );

 return p;
 }

 template< typename B >
 void recycle( B const * const p ) const throw()
 {
 unlock( ( void * ) p );

 shared< pageable >::recycle( p );
 }
};

} // namespace policy

namespace internal
{

// Template for basic-memory based buffers.
template< typename T, typename >
struct buffer
{
 buffer( buffer const * const p ) throw() :
 next( p )
 {
 }

 union
 {
 // The next buffer on the list.
 buffer const * const next;

 // Alignment enforcement for the payload.
 T * align;
 };

 // The cached objects reside beyond this offset.
 // A trick to keep everything within the same chunk.

private:

 // Just allow placement new and explicit destruction.

 static void * operator new ( std::size_t ) throw();
 static void operator delete ( void * ) throw();

 static void * operator new [] ( std::size_t ) throw();
 static void operator delete [] ( void * ) throw();

 buffer( buffer const & );
 buffer & operator = ( buffer const & );
};

// Partial specialization for shared-memory based buffers.
template< typename T, int S >
struct buffer< T, policy::shared< S > >
{
 buffer( buffer const * const p ) throw() :
 handle( handle ), next( p )
 {
 }

 // The shared memory associated handle.
//
// WATCH OUT!
 // This will be set in placement new
 // even before the constructor is called!
//
 int const handle;

 union
 {
 // The next buffer on the list.
 buffer const * const next;

 // Alignment enforcement for the payload.
 T * align;
 };

 // The cached objects reside beyond this offset.
 // A trick to keep everything within the same chunk.

private:

 // Just allow placement new and explicit destruction.

 static void * operator new ( std::size_t ) throw();
 static void operator delete ( void * ) throw();

 static void * operator new [] ( std::size_t ) throw();
 static void operator delete [] ( void * ) throw();

 buffer( buffer const & );
 buffer & operator = ( buffer const & );
};

} // namespace internal

template< typename >
struct strategy
{
 enum { shared = false };
};

template< int S >
struct strategy< policy::shared< S > >
{
 enum { shared = true };
};

template< typename T >
inline T * tmp_array( std::size_t const n ) throw()
{
 return static_cast< T * >( ::alloca( n * sizeof( T ) ) );
}

inline std::size_t largest_page() throw()
{
 std::size_t largest = ::sysconf( _SC_PAGESIZE );

 int n = ::getpagesizes( NULL, 0 );
 std::size_t * size = tmp_array< std::size_t >( n );

 if ( ::getpagesizes( size, n ) != -1 )
 while ( --n >= 0 )
 if ( size[ n ] > largest )
 largest = size[ n ];

 return largest;
}

// The specialized memory pool of T objects.
template
<
typename T,
 typename A = policy::c< policy::pageable >
>
struct pool
{
 // Do not pre-allocate anything.
 // This provides very fast construction.
 pool
 (
 std::size_t const page_size =
 strategy< A >::shared
 ? largest_page()
 : ::sysconf( _SC_PAGESIZE )
 )
 throw() :
 allocator( page_size ),
 segment( 0 ),
 expanding( 0 ),
 available( 0 )
 {
 }

 ~pool() throw()
 {
 // An iterative instead of a recursive deleter.
 // This assures no stack overflow will ever happen here.
 while ( segment )
 {
 buffer const * const p = segment;

 segment = segment->next;

 p->~buffer();

allocator.recycle( p );

}
 }

 // The function expand() can be delayed as much as desired.
 // It will be automatically called if absolutely necessary.
 // One thread will do the expand and others will wait.
 void expand() throw( bad_alloc )
 {
 // Serialize and minimize concurrent expansions.
 if
 (
 ::atomic_cas_ptr( & expanding, 0, ( void * ) 1 )
 ==
 0
 )
 {
 // The modifying thread attempts the expansion.
 // Blocked threads will get notified at end.
 try
 {
 allocate();

 // Release other threads.
 ::atomic_swap_ptr( & expanding, 0 );
 }

 catch ( std::bad_alloc const & )
 {
 // Release other threads.
 ::atomic_swap_ptr( & expanding, 0 );

 // Notifies itself about the exception.
 throw bad_alloc( ENOMEM );
 }
 }
 else
 // Wait on loop before resuming.
 // Better than throwing exceptions.
 while
 (
 ::atomic_cas_ptr( & expanding, 0, 0 )
 ==
 ( void * ) 1
 )
 ;
 }

 void * request() throw( bad_alloc )
 {
 start:

 try
 {
 slot * a;

 do
 {
 if ( ! ( a = available ) )
 throw bad_alloc( ENOMEM );
 }
 while
 (
 ::atomic_cas_ptr( & available, a, a->next ) != a
 );

 return a;
 }

 catch ( bad_alloc const & )
 {
 try
 {
 // Race for expansion.
 expand();
 }

 catch ( ... )
 {
 // Out of memory.
 throw;
 }
 }

 // The previous expansion succeeded.
 // Try again to fulfill the request for a slot.
 goto start;
 }

 void recycle( void * p ) throw()
 {
 slot * a;

 do
 {
 a = available;
 reinterpret_cast< slot * >( p )->next = a;
 }
 while ( ::atomic_cas_ptr( & available, a, p ) != a );
 }

private:

 void allocate() throw( bad_alloc )
 {
 segment =
 ::new ( allocator.request< buffer >() )
 buffer( segment );

 // Skip the buffer's prefix.

 slot * const p =
 reinterpret_cast< slot * >
 (
 reinterpret_cast< intptr_t >( segment )
 +
 sizeof( buffer )
 );

 // Add new slots from the new buffer's payload.
// Is it worthy to unroll (parallelize) the loop?

 slot const * const limit =
p
 +
 ( allocator.page_size - sizeof( buffer ) )
 /
 sizeof( slot );

 slot * tail = p;
slot * tracker = tail++;

 while ( tail < limit )
 {
 tracker->next = tail;
tracker = tail++;
 }

 (--tail)->next = 0;

 // Prepend the new slots.

 slot * a;

 do
 {
 a = available;
 tail->next = a;
 }
 while ( ::atomic_cas_ptr( & available, a, p ) != a );
 }

private:

 // The low-level (OS) memory allocator.
 A const allocator;

 // Convenience.
typedef internal::buffer< T, A > buffer;

 // The list of buffers.
// Each node contains the slots of data.
 buffer const * segment;

 // Expansion serialization control.
 int volatile expanding;

 // The slots of data (T objects)
 // and list of available (free) slots.
 // Reusing free slots for the list's pointers.
 union slot
 {
 // Just the T size is needed (for space reservation).
 // Avoid further dependencies around the T type.
 unsigned char data[ sizeof( T ) ];

 slot * next;
 }
 * volatile available;

private:

 pool( pool const & );
 pool & operator = ( pool const & );
};

// A base class for a very convenient integration
// with the standard C++ memory management operators.
template
<
 typename T,
 typename A = policy::c< policy::pageable >
>
struct operations
{
 static void * operator new ( std::size_t ) throw()
 {
 return pool.request();
 }

 static void operator delete ( void * p ) throw()
 {
 pool.recycle( p );
 }

 // The pool must be common (static).
 static memory::pool< T, A > pool;
};

// The multiple translation unit template merging
// avoids manually defining their static declarations.
template< typename T, typename A >
pool< T, A > operations< T, A >::pool;

} // namespace memory

Next are a few raw examples just for illustration. More realistically, the pool would be internal to some other object such as an advanced C++ smart pointer in order to provide more efficient and multi-threaded allocations and deallocations.

Example 1:

template< typename T >
struct pointer
{
 ...

// All instances must share the same pool.
 static memory::pool< std::size_t > pool;

...
};

// The shared pool.
template< typename T >
memory::pool< std::size_t > pointer< T >::pool;

Example 2:

void f()
{
// An ISM cache.
static memory::pool
 <
 std::size_t,
 memory::policy::shared< memory::policy::ism >
 >
 pool;

...
}

Example 3:

// On Intel x86-64, sizeof( S ) = 8 = sizeof( void * )
// So, there's no internal fragmentation.
struct S
{
 char c;
 int i;
};

void f()
{
 memory::pool
 <
 S,
 memory::policy::c< memory::policy::locked >
 >
 pool( memory::largest_page() );

 try
 {
// Manual pre-expand (just for illustration).
 pool.expand();

 S * p = ( S * ) pool.request();

 p->c = 'S';
 p->i = 11;

 pool.recycle( p );
 }

 catch ( ... )
 {
 ...
 }
}

Example 4:

#ifndef S_HXX
#define S_HXX

#include "memory.hxx"

struct S : memory::operations< S >
{
 int code;
 char text[ 10 ];

 S( int c = 0, char const * t = "" ) : code( c )
 {
 ::strlcpy( text, t, sizeof(text) - 1 );
 }
};

#endif /* S_HXX */

#include "s.hxx"

void f()
{
    S * s = new S;

    ...

    delete s;
}

#include "s.hxx"

struct derived_S : S
{
   ...
};

extern void f();

void g()
{
    S * s = new S;

...

    delete s;

...

    derived_S * ds = new derived_S;

...

    delete ds;

...

    f();
}