Implementing my own Time Series Database: Core Structure (Part 1)
I'm currently working a lot with time series data and databases. That sparked an interest in looking under the hood and understand database systems more deeply. Fueled by "how hard can this actually be"-ignorance I've started to implement my own time series database. Because learning one new thing wasn't enough, I also learned zig in parallel.
I've started the journey by binge-watching the CMU Intro to Database Systems lectures. Andy Pavlo is an amazing lecturer, and I can only recommend this series to anyone wanting to understand DB systems or being nostalgic about their time at university.
My goal was to build a DB that could:
- store time series identified by an ID
- query and transform these series using a simple query language
- be accessible over the network
Core Structure
I've started my journey by figuring out to work with pages. This finally resulted in the following overall architecture for the core of my database.
A page directory serves as the central component handling pages in memory. Any other part of the system will retrieve and interact with pages through this directory. The directory itself is only responsible for pages in memory. Whenever it needs to persist or read pages from disk, it will use the File Manager.
The file manager is the single structure interfacing with the file system. Its only responsibility is to persist pages to disk and retrieve them again later on. Because to the layout of files I've chosen it knows some details about the database implementation, but could be refactored to be completely agnostic of the database using it.
The main user of the page directory is the B+Tree implementation. Series in the database system are represented by B+Trees. Other systems will instantiate B+Tree instances for each series they interact with. The tree structure loads necessary data from the series pages via the page directory.
Pages
Pages are the core entities of database systems. As databases have to deal with (much) more data than will fit into memory data has to be read and written to disk frequently. Therefore, all persistent data is structured into pages. In essence, pages are just data blocks of a specified size. These can be moved between disk and memory without transformations.
In my system, the start of each page is used as a page header. The header specifies what kind of data is stored in the page. Additionally, it has a version number for future use and a CRC hash to detect data corruption. The body of a page is just a block of data, which structure depends on its type. The interpretation of the body is controlled by the user of the page.
As pages are transferred between disk and memory, it is important to have exact control over their layout. Regular struct have no guarantees on the layout of fields in memory and have additional padding bytes between fields for alignment. Therefore, I used packed struct for all page definitions.
The size of a page is controlled by a comptime variable PAGE_SIZE. Thanks to zig's comptime feature, all related sizes and offsets can be derived from this and compile time. I can also derive capacity of the B+Tree branches and leaves (see below) from this. Therefore, I can simply adjust the page size, and all other parts will adjust accordingly.
/// total size of pages in bytes
pub const PAGE_SIZE: usize = 4 * 1024;
/// size of page header in bytes
pub const PAGE_HEADER_SIZE: usize = @bitSizeOf(PageHeader) / 8;
/// size of page data section in byters
pub const PAGE_DATA_SIZE: usize = PAGE_SIZE - PAGE_HEADER_SIZE;
One limitation I ran into is that zig does not (yet) allow arrays in packed structs . Because of this limitation, I had to define an integer type of the correct size to specify the body of the page, instead of simply using [BODY_SIZE]u8. Another alternative would have been @Vector(BODY_SIZE, u8) (as found in the issue), but this felt more wrong.
pub const PageHeader = packed struct {
    type: PAGE_TYPE,
    version: u8,
    crc: u32 = 0,
};
pub const PageData = @Type(.{ 
	.int = .{
		.signedness = .unsigned,
		.bits = PAGE_DATA_SIZE * 8
	}
});
pub const Page = packed struct {
    header: PageHeader,
    // data is only used as a blob of data, and not directly used
    data: PageData,
}
File Structure
After implementing my page structure, I had to think about how to organize them on disk. I've decided to store each series created in the database as an individual file. Pages within a file/series are identified by there position within the file. Thereby, it was easy to get any page for a particular series by the file name. An additional hope was that this would keep the related data close in the filesystem, improving performance, but I never tested this hypothesis.
This structure means that pages have two IDs; a local and a global ID. The local ID is the page's position within a series file. The global ID is the concatenation of series ID and the local page ID. The first page (local ID 0) of each file is a series header page. This page keeps information about the series. At the moment it only stores the ID of the root pages for the B+Tree.
Page Directory
The page directory is responsible for holding and managing pages in memory. On initialization, it allocates a fixed number of slots which can hold pages. Thereby, no memory has to be allocated when a new page is loaded from disk and the footprint of the directory is static. For each slot a PageHandle handle is created. These handles track usage of the slot; which pages is currently loaded, whether it has been modified or if it is currently in use.
The interface for the using systems is rather simple. Pages can be requested by there ID, either in read or write mode. The page directory transparently loads them from disk, through the file manager, and provides the corresponding page handle to the caller. Once the page is no longer needed the page handle is "returned" to the directory.
Eventually, during the runtime of the system, enough pages will have been loaded that all pages are occupied. At this point, a currently unused page has to be evicted. Each page handle has a used flag, which is set whenever anyone uses the page. When the directory looks for a free slot it iterates over the slots and skips any slot with a raised used flag and unsets it. Once it encounters a slot with an unset flag which is currently not locked the pages is replaced. This approach approximates "least recently used", but is much faster than a full implementation.
Modified pages are not directly saved to disk when they are freed by a writing user. Instead they are kept dirty in the directory and only written to disk when the page has to be evicted. To prevent data loss, I've added a background worker with cycles through all slots to persist dirty pages if they are currently not in use. (A lesson learnt after losing a few weeks of data which was never persists.)
B+Tree
I chose to represent series as B+Trees, a.k.a. the best data structure in computer science. In contrast to other time series databases, which often use LSM trees I chose to use B+Tree as it keeps the data ordered at every point in time. This makes traversal of the series data simple and (hopefully) fast.
B+Trees consist of branch and leaf nodes. Branch nodes keep a list of references to their children. Between two child references the branch keeps a threshold values. The referenced child between two threshold values will only hold values between these values. The rules by which the tree is transformed when values are added or removed keep it balanced, where a leaf nodes are at the same depth. This allows lookup of values within a few steps through the tree.
In my database the leaf nodes store tuples of (timestamp, value) in order. Additionally they hold a reference to their next sibling. This allows fast traversal of the leaf nodes when scanning over a time period.
Next Part: Query Language
