SQLite : Pages & Btrees

1INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

Our row of sandwich data exist as an array of bytes inside SQLite that's encoded using its record format. For our inserted row, we see the following bytes on disk :

15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

Byte Chunks	Description	Encoding
15	This row's payload. In bytes.	Varint
01	Row ID, primary key.	Varint
05 00	Overflow pages configuration
1b	TEXT type , 7 bytes.	TEXT-BLOB type
07	Floating point type	Floating point type
01	Integer that can fit in 8 bit.	Integer type
49 74 61 6c 69 61 6e	"Italian"	UTF-8
40 1e 00 00 00 00 00 00	7.5	IEEE-754-2008 floating-point
02	2	Integer

Varint Encoding #

These first two fields use what's called a variable-length integer encoding (varint). This encoding is used so that we don't use a huge 8-byte field for every integer and waste a bunch of space. Varints use a simple trick. The high bit is used as a flag to indicate if there are more bytes to be read and the other 7 bits are our actual data. So, if we wanted to represent 1000, we start with its binary representation into 7 bit chunks :

0000111 1101000

Then we add a "1" flag bit to the beginning of the first chunk to indicate we have more chunks, and a "0" flag bit to the beginning of the second chunk to indicate we don't have any more chunks.

10000111 01101000

With varints, we can now store our integer in 2 bytes instead of 8.

TEXT - BLOB Type Encoding #

For our name column, the 0x1b value specifies that it is a TEXT type and has a length of 7 bytes. Type values that are odd and are greater or equal to 13 are TEXT fields and can be calculated with the formula (n * 2) + 13. So, our 7 byte string is (7 * 2) + 13 which is 27, or 0x1b in hex. BLOB fields are similar, except they're even numbers calculated as (n * 2) + 12.

Floating Point Type Encoding #

Next, we have our "length" field which is a floating-point number. These are always encoded as a 0x07.

Integer Type Encoding #

After that, we have our "count" field which is an integer. Integers that can fit in an 8-bit integer are represented with a type value of 0x01. 16-bit integers are 0x02, 24-bit integers are 0x03 and so on.

Surprisingly though, adding or updating rows is still nearly as instantaneous as when I had a single row. So how does SQLite update a big database in a fraction of a second? The answer is pages and b-trees.

Pages #

A naive approach to a database would be just to pack records in sequentially in a file. However, there's no way to insert or update rows in the middle of the file without shifting and rewriting all the bytes after the new row.

Instead, SQLite groups rows together into 4KB chunks called "pages". Why 4KB? That's what file systems typically use as their page size. So keeping everything aligned reduces page fetches from disk. Disks are usually the slowest part of a database, so limiting page fetches can have a huge performance win.

If we inspect our page, we can see its header :

0d xx xx 00 03 xx xx xx

There are several parts of this header but I masked out several bytes so we can focus on two particularly important fields. The first byte, 0x0d, indicates the page type. This page type is a table leaf. We'll talk about those more with b-trees.

The second important field is the cell count, which is 0x0003. This tells us that 3 records exist in this page.

After the header, we have the cell pointer index :

0f e9 0f d2 0f bb

This is a list of 2-byte values that represent offsets in the page for each records. The first record exists at offset 4,073 (0x0fe9), the second record exists at offset 4,050 (0x0fd2), etc. SQLite packs rows at the end of the page and then works backwards.

After the index, we have a bunch of zeros until we reach the content are of the page, which holds our row data in record format.

B-Trees #

Now that we've chunked our data into pages, we can update a subset of our data without having to rewrite the whole file. That's great for writing.

But we have a list of pages to search through if we want to query of our data. The simplest approach would be to start at the first page, and then search every page for a given row. This is called a "table scan" and it can be really slow, especially if your data is at the end of your table.

If you are searching by your primary key, you could perform a binary search since the table is already sorted by primary key. For a binary search, you search the page right in the middle of your data record. If the record exist, great! You're done. But if the data you're searching for is before that page, then you find a new "middle" page in the first half of your table. Keep slicing the search space in half and searching until you find the data. A binary search has a logarithmic time complexity, or O(log n).

If we have a small 2MB database with 4KB pages then that means we have 512 pages. A binary search of that many pages would have worst-case scenario log2(512) = 9 pages. That means we might have to read nine pages in our database just to find one record. We still want to reduce that as much as possible.

SQLite is structured as a b-tree, which is a data structure where each node can point to two or node child nodes and the values in these child nodes are all sorted. There are a ton of different variants of b-trees but the one that SQLite uses is called b+tree. This type of tree stores all data in the leaf node, but also have an index of key ranges in the branch pages called "interior pages".

Leaf Page #

Let's say our leaf pages hold data records that are each 40 bytes. That means we can fit about 100 records into one 4KB page. If we have less than 100 records, we only need one page.

Interior Page #

But what happens when we add a 101st data and it doesn't fit anymore? SQLite will split the page into two leaf pages and add an interior page as the root of our b+tree that points to our child leaf pages. This interior page stores the key ranges for the leaf pages so that when you search, you can see what ranges each child page holds without having to actually read that child page.

Interior pages are also 4KB and they store pairs of child primary keys and their page numbers, so each entry in our interior page takes about 8 bytes. That means we can store about 500 entries in an interior page.

For our 2MB database, that means we can hold the key range for all of our leaf pages in a single interior page. To find a given record, we only need to search the root interior page to find the correct leaf page. Our worst case search is only 2 pages.

Interior Page's Next Split #

But what happens when our root interior page fills up and we need a database bigger than 2MB? We split the interior page in two and add a new root parent interior page that points to the split interior pages. We no have a tree depth of 3.

Since our new root can hold about 500 references to second level interior pages, and those second level pages hold about 500 references to leaf pages, we can store about 250,000 pages, or about 1GB of data. If we continue adding rows, we'll need to split the root page again. At a tree depth of 4, we can hold about 500^3 leaf pages, or about a 476GB database. That's mean we only need to read 4 pages to find a given record, even in a huge database!

Reference #

Ben Johnson (July 27, 2022) SQLite Internals : Pages & B-trees Fly.io Blog