The sources of the pmemobjfs file system are available here. Please refer to README file for instructions on how to create a file system layout and mount it.
NOTE: This is just an example implementation of file system in user space using the libpmemobj library and it is not considered to be production quality. Please do not use this file system to store your data you care about because it may be lost.
The definition of libpmemobj layout looks like this:
1
2
3
4
5
6
7
8
9
typedef uint8_t objfs_block_t;
POBJ_LAYOUT_BEGIN(pmemobjfs);
POBJ_LAYOUT_ROOT(pmemobjfs, struct objfs_super);
POBJ_LAYOUT_TOID(pmemobjfs, struct objfs_inode);
POBJ_LAYOUT_TOID(pmemobjfs, struct objfs_dir_entry);
POBJ_LAYOUT_TOID(pmemobjfs, objfs_block_t);
POBJ_LAYOUT_TOID(pmemobjfs, char);
POBJ_LAYOUT_END(pmemobjfs);
It consists of a root object and four typed OIDs. The objfs_block_t
is
a typedef for the uint8_t
type in order to bind an unique type number for
this data structure. The typed OID for a char
is required in order to
allocate a fixed-length string from pmemobj pool. The rest of data
structures are described in details in the following chapters.
The main data structure of the pmemobjfs is the struct objfs_super
which
plays a role of a super-block in traditional file systems:
1
2
3
4
5
struct objfs_super {
TOID(struct objfs_inode) root_inode; /* root dir inode */
TOID(struct tree_map) opened; /* map of opened files / dirs */
uint64_t block_size; /* size of data block */
};
The root_inode
field holds the inode object of the root directory which is
created during creation of the file system layout.
The block_size
field holds the size of data block which the files content and
directory entries are stored in.
The opened
field is a tree map
of opened inodes. This map is required for handling the unlink operation on
opened files.
The next important data structure used by the pmemobjfs is the
struct objfs_inode
which represents a file system object.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
struct objfs_inode {
uint64_t size; /* size of file */
uint64_t flags; /* file flags */
uint64_t dev; /* device info */
uint32_t ctime; /* time of last status change */
uint32_t mtime; /* time of last modification */
uint32_t atime; /* time of last access */
uint32_t uid; /* user ID */
uint32_t gid; /* group ID */
uint32_t ref; /* reference counter */
union {
struct objfs_file file; /* file specific data */
struct objfs_dir dir; /* directory specific data */
struct objfs_symlink symlink; /* symlink specific data */
} d;
};
It contains basic attributes of an object:
The inode may represent a file, directory or a symbolic link. It contains a separate structures for each inode type which holds essential information about the specific type of inode:
The data specific for directory object contains a doubly-linked list of directory entries.
1
2
3
struct objfs_dir {
PDLL_HEAD(struct objfs_dir_entry) entries; /* directory entries */
};
The data specific for file object contains a tree map of blocks. The map key consist of block number and the value contains a PMEMoid to the data block.
1
2
3
struct objfs_file {
TOID(struct tree_map) blocks; /* blocks map */
};
The data specific for symbolic link contains a length of link and the link data.
1
2
3
4
struct objfs_symlink {
uint64_t len; /* length of symbolic link */
TOID(char) name; /* symbolic link data */
};
The struct objfs_dir_entry
represents a directory entry. It contains a
persistent pointers to the neighbours, a pointer to corresponding inode and
a name:
1
2
3
4
5
struct objfs_dir_entry {
PDLL_ENTRY(struct objfs_dir_entry) pdll; /* list entry */
TOID(struct objfs_inode) inode; /* pointer to inode */
char name[]; /* name */
};
The maximum length of the name of a directory entry is forced by the block size
specified when creating a file system. It is equal to
block_size - sizeof (struct objfs_dir_entry)
.
All operations which modifies the file system structure are performed within a transaction, which protects the pmemobjfs layout from being broken if power failure occurred during any operation.
In this chapter I would like to describe in details some of the most important operations performed on the file system.
NOTE: In current implementation it is recommended to mount the pmemobjfs with the -s option. In this case the FUSE works in single-threaded mode and there is no need for synchronization mechanisms.
To create the pmemobjfs layout you can use the mkfs.pmemobjfs
command:
mkfs.pmemobjfs -s <size> -b <block size> /mnt/pmem/pmemobjfs.obj
By default it creates a file system layout with the minimal size required for
pmemobj pool and with block size equal to 512 - 64
. The default value for
block size is chosen to such value in order to minimize the internal
fragmentation of allocated blocks. We must keep in mind the fact that in current
implementation the allocation and out-of-band headers are kept in one cache
line before the allocation. Although the default value is chosen with respect
to the internal layout of the pmemobj pool, it is not required to keep it in
mind when creating the file system. An arbitrary value specified for the block
size is valid and the pmemobjfs will work properly.
The file system layout is created within a transaction. The following listing shows the most important parts of the routing for creating the pmemobjfs layout:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
...
objfs->pop = pmemobj_create(fname,
POBJ_LAYOUT_NAME(pmemobjfs), size, mode);
...
TOID(struct objfs_super) super = POBJ_ROOT(objfs->pop, struct objfs_super);
...
TX_BEGIN(objfs->pop) {
TX_ADD(super);
/* create an opened files map */
tree_map_new(objfs->pop, &D_RW(super)->opened);
/* create root inode, inherit uid and gid from current user */
D_RW(super)->root_inode =
pmemobjfs_new_dir(objfs, TOID_NULL(struct objfs_inode),
"/", root_flags, uid, gid);
D_RW(super)->block_size = bsize;
} TX_ONABORT {
fprintf(stderr, "error: creating pmemobjfs aborted\n");
ret = (-ECANCELED);
} TX_END
...
pmemobj_close(objfs->pop);
...
At the beginning the pmemobj pool is created with specified name of layout,
size and mode. Next the root object is allocated when calling the POBJ_ROOT
macro for the first time. According to the documentation we can be sure the
root object is zeroed. Next the root object is initialized within a
transaction. The tree map is created for opened inodes, the root inode is
created and the block size is stored. Due to the fact that all operations
are performed within the transaction we can be sure that either the
root object will be filled up entirely or won’t be at all. At the very end the
pmemobj pool is closed and as a result we have a pmemobjfs file system
layout initialized.
The following listing presents the most important operations performed when creating new directory on pmemobjfs file system:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...
TX_BEGIN(objfs->pop) {
TOID(struct objfs_inode) new_inode =
pmemobjfs_new_dir(objfs, inode, name, flags, uid, gid);
TOID(struct objfs_dir_entry) entry =
pmemobjfs_dir_entry_alloc(objfs, name, new_inode);
pmemobjfs_add_dir_entry(objfs, inode, entry);
TX_ADD_FIELD(inode, mtime);
D_RW(inode)->mtime = time(NULL);
} TX_ONABORT {
ret = (-ECANCELED);
} TX_END
...
After beginning a new transaction the new directory is allocated and
initialized. After creating the inode with new directory, the
struct objfs_dir_entry
is allocated with the specified name and associated
newly created inode. The new directory entry is then added to the current
directory’s doubly-linked list of entries and modification time is updated.
The pmemobjfs_new_dir
function is presented on the following listing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
TX_BEGIN(objfs->pop) {
inode = pmemobjfs_inode_alloc(objfs, flags, uid, gid, 0);
pmemobjfs_inode_init_dir(objfs, inode);
/* add . and .. to new directory */
TOID(struct objfs_dir_entry) dot =
pmemobjfs_dir_entry_alloc(objfs, ".", inode);
TOID(struct objfs_dir_entry) dotdot =
pmemobjfs_dir_entry_alloc(objfs, "..", parent);
pmemobjfs_add_dir_entry(objfs, inode, dot);
pmemobjfs_add_dir_entry(objfs, inode, dotdot);
} TX_ONABORT {
inode = TOID_NULL(struct objfs_inode);
} TX_END
return inode;
First of all the new inode is allocated with specified permissions and ownership and the directory specific data of inode is initialized. Next the current and parent directory entries are allocated and added to the newly created directory. Everything is done within a transaction. In this case the transaction will be nested because this function is called from inside other transaction, but according to the libpmemobj documentation if the outer transaction aborts all changes made within a nested transaction will be rolled back as well so we do not need to worry about committing the nested transaction before committing the outermost one.
The next interesting operation is allocating the file blocks. The following listing shows how it is implemented:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
TX_BEGIN(objfs->pop) {
/* allocate blocks from requested range */
uint64_t b_off = offset / objfs->block_size;
uint64_t e_off = (offset + size) / objfs->block_size;
for (uint64_t off = b_off; off <= e_off; off += 1)
pmemobjfs_file_get_block_for_write(objfs, inode, off);
time_t t = time(NULL);
/* update modification time */
TX_ADD_FIELD(inode, mtime);
D_RW(inode)->mtime = t;
/* update status change time */
TX_ADD_FIELD(inode, ctime);
D_RW(inode)->ctime = t;
/* update inode size */
D_RW(inode)->size = offset + size;
TX_ADD_FIELD(inode, size);
} TX_ONABORT {
ret = (-ECANCELED);
} TX_END
The most important function is pmemobjfs_file_get_block_for_write
which
either allocates new block or returns previously allocated block. In the latter
case the previously allocated block is added to the transaction’s undo log in
order to track all file’s modifications. The following listing shows the
implementation of this function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
TOID(objfs_block_t) block =
pmemobjfs_file_get_block(objfs, inode, offset);
if (TOID_IS_NULL(block)) {
TX_BEGIN(objfs->pop) {
block = TX_ALLOC(objfs_block_t,
objfs->block_size);
tree_map_insert(objfs->pop, D_RW(inode)->file.blocks,
GET_KEY(offset), block.oid);
} TX_ONABORT {
block = TOID_NULL(objfs_block_t);
} TX_END
} else {
TX_ADD(block);
}
return block;
The pmemobjfs_file_get_block
function returns a block at given offset or
returns OID_NULL
if the block is missing.
The pmemobjfs_file_get_block_for_write
and pmemobjfs_file_get_block
functions are used in write and read operations respectively when operating
on file’s data.
The unlink operation utilizes two interesting mechanisms implemented with
the pmemobjfs. The first one is the inode’s reference counter which is
increased each time the given inode is referenced by other data structure.
The inode is freed when the reference counter is equal to zero. The functions
which operates on inode’s reference counter are pmemobjfs_inode_get
and
pmemobjfs_inode_put
.
The unlink operation is really simple:
1
2
3
4
5
6
7
8
TX_BEGIN(objfs->pop) {
pmemobjfs_remove_dir_entry(objfs, inode, entry);
TX_ADD_FIELD(inode, size);
D_RW(inode)->size--;
} TX_ONABORT {
ret = (-ECANCELED);
} TX_END
All the work is performed by the pmemobjfs_remove_dir_entry
function:
1
2
3
4
5
6
7
TX_BEGIN(objfs->pop) {
pmemobjfs_inode_put(objfs, D_RO(entry)->inode);
PDLL_REMOVE(D_RW(inode)->dir.entries, entry, pdll);
pmemobjfs_dir_entry_free(objfs, entry);
} TX_END
The reference counter is decreased and the directory entry is removed from
the doubly-linked list of current directory and freed. The inode is freed if the
reference counter becomes zero after calling the pmemobjfs_inode_put
function.
In case of unlinking an opened file the inode will not be freed immediately because the open operation increases the inode’s reference counter and adds the inode to the tree map of opened inodes:
1
2
3
4
5
6
7
8
9
TX_BEGIN(objfs->pop) {
/* insert inode to opened inodes map */
tree_map_insert(objfs->pop, D_RW(super)->opened,
inode.oid.off, inode.oid);
/* hold inode */
pmemobjfs_inode_get(objfs, inode);
} TX_ONABORT {
ret = (-ECANCELED);
} TX_END
Using those two mechanism it is really simple to implement the unlink operations with respect to opened files or directories and creating hard links.
Please note that hard links are not implemented currently due to some problems with the FUSE kernel module which cause the appropriate callback function is not called.
The pmemobjfs provides a feature of creating transactions. The current implementation is limited to creating a single transaction at a time for the whole file system, but this feature could be extended to more transactions, for specified directories or files. The transaction is controlled via the ioctl calls. For simplicity there have been developed three simple commands which do the required work:
For the above commands the path to the pmemobjfs mount point or any other
directory must be given. After beginning the transaction all modifications
performed on the file system files, directories or links are tracked by the
libpmemobj transactions. It tracks all changes of attributes and data.
They are made persistent after calling the pmemobjfs.tx_commit
command.
All changes are visible immediately to the user but can be rolled back simply by
calling the pmemobjfs.tx_abort
command. The transaction can be aborted
implicitly if any exceptional situation occurred like for example out of memory
error when allocating file block.
NOTE: Aborting the transaction when other process is still working on the file system may lead to undefined behavior. For example if a new file was created within a transaction and the transaction is aborted while some other process is writing to the file leads to undefined behavior.
In this section I would like to present some performance tests results executed using the fio utility with the following configuration file:
The block size value has been chosen in order to minimize internal fragmentation on pmemobjfs file system.
The tests were run on Fedora 22 distribution, kernel version 4.2.0 with DAX support and on the pmem block device.
The tests were run on the following file systems:
The pmemobjfs (NTB) is a pmemobjfs version without tracking file blocks (PMEMOBJFS_TRACK_BLOCKS=0). The fusexmp_fh is a file system which redirects all operations to the root file system. It is available in the FUSE examples.
The results are presented in the following table:
FS | READ BW [KB/s] | WRITE BW [KB/s] |
---|---|---|
ext4 + dax | 232030 | 231333 |
fusexmp_fh + ext4 + dax | 28687 | 28602 |
pmemobjfs | 29120 | 29034 |
pmemobjfs (NTB) | 30112 | 30023 |
The results shows quite huge overhead from the FUSE itself, but it shows that pmemobjfs has slightly better performance than the fusexmp_fh example file system which is quite good information for us :).
The pmemobjfs example shows how the libpmemobj API works in a real application. It can be used to run some performance tests using well known file system test suites. If you have any questions or ideas for improvement of the pmemobjfs please feel free to join a discussion on our Google Group.