Linux kernel version 5.1 brings in support for the volatile-use of persistent memory as a hotplugged memory region (KMEM DAX). When this feature is enabled, persistent memory is seen as a separate memory-only NUMA node(s). libmemkind API was extended to include new kinds that allow for automatic detection and allocation from these new persistent memory NUMA nodes.
1. Kernel 5.1 with KMEM DAX driver enabled.
If support of KMEM DAX driver isn’t enabled in your kernel you will have to configure
proper driver installation by run nconfig
and enable driver.
2. ndctl and daxctl version 66 or later.
Reconfiguration of Device-DAX depends on dax-bus device model.
Kernel should support /sys/bus/dax
model. To migrate it from
/sys/class/dax
to /sys/bus/dax
model please use daxctl-migrate-device-mode.
The list of available NUMA nodes on the system can be retrieved using numactl
.
An example of initial configuration is presented below:
To create a namespace in Device-DAX mode as a standard memory from all the available capacity of NVDIMM:
To list DAX devices:
To reconfigure DAX device from devdax
mode to a system-ram
mode:
After this operation, persistent memory is configured as a separate NUMA node and can be used as a volatile memory. For the example configuration below, persistent memory NUMA node is Node 3 (NUMA node without any assigned CPU):
Libmemkind supports the KMEM DAX option in three variants. For a better description of memory policies from these variants, see the animations below. The example configuration on which the animations are based on is as follows:
Which corresponds to:
Note:
MEMKIND_DAX_KMEM: This is the first variant where memory comes from the closest persistent memory NUMA node at the time of allocation.
The process runs only on CPU 0, which is assigned to Node 0. Node 2 is the closest persistent memory NUMA node to Node 0, therefore deploying MEMKIND_DAX_KMEM results in taking memory only from Node 2. If there is not enough free memory in Node 2, to satisfy an allocation request, inactive pages from Node 2 are moved into the swap space - freeing up the memory for MEMKIND_DAX_KMEM use case.
MEMKIND_DAX_KMEM_ALL: This is the second variant where the memory comes from the closest persistent memory NUMA node available at the time of allocation.
A similar situation to the scenario presented in MEMKIND_DAX_KMEM except that when there is not enough free memory in Node 2, to satisfy an allocation request, the allocation pattern switches to Node 3. When available space is exhausted in Node 3 - swap space is used.
MEMKIND_DAX_KMEM_PREFERRED: This is the third variant where the memory comes from the closest persistent memory NUMA node at the time of allocation. If there is not enough memory to satisfy a request the allocation will fall back on other memory NUMA nodes.
Again the allocation starts from Node 2. When there is not enough free memory in Node 2, the allocation switches to other nodes in order of increase distance from the preferred node based on information provided by the platform firmware: Node 0, Node 3 and ending in Node 1. When available space is exhausted in Node 1 - swap space is used.
Caveats
For MEMKIND_DAX_KMEM_PREFERRED, the allocation will not succeed if two or more persistent memory NUMA nodes are in the same shortest distance to the same CPU on which the process is eligible to run. A check on that eligibility is performed upon starting the application.
MEMKIND_DAX_KMEM_NODES is an environment variable - a comma-separated list of NUMA nodes that are treated as persistent memory. This mechanism is dedicated to override the automatic mechanism for detecting the closest persistent memory NUMA nodes.
Leveraging new kinds is similar to the usage presented in the previous post
with one exception that the kinds presented in this post do not need to be created first with memkind_create_pmem()
method.
With libmemkind
, it is possible to distinguish the physical placement of an allocation: