Skip to content

Implementation Details

Jung-Sang Ahn edited this page Jul 22, 2014 · 7 revisions

Write-Ahead Logging

For maximizing the overall disk throughput, ForestDB writes all incoming mutations to a database file in a log-structured append-only manner. However, the index structures (i.e., ID index and Seq index) are not immediately updated in order to reduce the overall write amplification, and the batch of these pended mutations is eventually applied to the index structures later. For this, ForestDB maintains in-memory WAL ID index and WAL Seq index for indexing WAL documents that are not indexed by the main index structures. In this way, those WAL documents can be efficiently retrieved by looking up in-memory WAL indexes.

WAL in-memory structure

As shown in the above figure, these in-memory WAL indexes are organized as hash table that maps from the hash value of a key (ID) or its sequence number to the offset of a database file where (metadata, value) entry is located. Since these index structures reside only in-memory, there is no further disk I/O overhead when we lookup WAL entries.

WAL disk layout

There is a threshold on the total size of WAL entries, and we check the WAL size for every commit operation. The batch of main index updates for WAL entries are aggregated and reflected in both ID index and Seq index if the total size of WAL entries is larger than the configured threshold. Updated nodes in the main index structures are sequentially appended at the end of a database file, and all entries in in-memory WAL indexes are removed. The upcoming mutations from an application will be appended at the end of a database file and added to the in-memory WAL indexes again.

File Layout

ForestDB supports an optional block align in writing a document (i.e., key's metadata and value) in a database file. If a block align is enabled, then when a document is written in a database file, ForestDB first checks the remaining space of the current block to avoid a document being written over several blocks. The document is simply appended if the remaining space is larger than the size of document. If not, the document is written at the first byte of next block skipping the space of the current block. When the size of a document is larger than a block size, we compare the total number of blocks to be occupied by this document when the document is written at the end of the current position with when the document is written at the beginning of the next block. Then, we choose the location that occupies less blocks. The following figure shows a file layout when the block align option is enabled. Note that the block align option for a document is disabled by default.

Block aligning for a document

Unlike documents, each node in B+ tree is exactly fitted into a block, and they are always written at the beginning of a new block to avoid interleaving with documents in the same block. Since ForestDB does not allow in-place updates, a modified node is written at the end of file invalidating the old node. If an update occurs on a leaf node, its parent node should be modified and written in a new block because the location of the leaf node is moved. For the same reason, all the nodes along the path from the updated leaf node to the root node should be updated and appended to the end of the file recursively. After writing all index updates, we finally write a database header block at the end of the file.

File layout for B+ Tree

Buffer Cache

ForestDB supports a separate global buffer cache for keeping frequently accessed data (e.g. index nodes of B+-tree) in memory, and also provides an option for bypassing the OS page cache through using the direct I/O flag (e.g., O_DIRECT).

Buffer Cache Block Diagram

This global buffer cache is a shared resource among writers and readers, and consists of a global LRU lists for opened files. We maintain a separate AVL tree for each database file to keep track of the list of dirty blocks belonged to that file. All dirty blocks are sequentially written back to the corresponding file in an ascending order of their offset values by traversing the AVL tree, which allows us to maximize the write throughput. Each file maintains a hash table with a key {hash(block_id)} and a value {a pointer to a cache entry in either the clean or dirty list} for a fast lookup in the global buffer cache.

Buffer Cache LRU

Recovery

ForestDB detects a database file corruption by reading the last block of the file. If the last block is B+ Tree index node, document or incorrect database header, then the file is corrupted. In this case, ForestDB scans each block reversely to find the last valid database header written after index updates, and then reconstructs WAL entries normally committed until the last valid header in the file.

Recovery from crash

ForestDB also maintains a 4-byte checksum value (using CRC32 by default) for each B+-Tree index node, document, and database header to detect a corruption in a database file. 1-byte block marker is added to the end of each block (4KB by default) to identify a block type (i.e., index node, document, or database header) as shown in the following figure.

Checksum based on CRC32

Clone this wiki locally