Designing a New Solana Accounts Database (part 1)

Author: Liam Heeger

Published: Jan 22, 2023

Note 1: This article assumes some knowledge on blockchains, and other similar mechanisms.

Note 2: This is a living article and most certainly not complete. Feel free to reach out via the provided contact info if there are any inaccuracies.

This is part one in what will likely become a series on the Solana accounts database. Here we explore some foundational concepts and touch on what a new design for a Solana accounts database may look like.

A Quick Introduction


Solana is very unique blockchain. As one of the first viable proof of stake blockchains, it has aimed to be highly performant, reliable and decentralized. Its detractors would have you know that it is has not met these aims, but that is why we are here!

The accounts database is one such area that is critical to the bottom-line performance of the Solana validator. This post will touch on the Solana validator’s accounts database, why we need a new one, and what a nw design needs to look like. There are still open questions, and I will briefly lay those out at the end. But first I should explain what in world the accounts database is!

What is the Accounts Database?


The accounts database is a component which, in some form or another, can be found in every blockchain node. Generally, this database is a view of the current state of the chain. I will liken it to a very large cache as in theory, you can compute the state of the blockchain by replaying all of the transactions since genesis to receive the exact same result. That is an extremely expensive endeavour though, so we have the state database (Solana calls it accounts database, I will interchange the terms) as a kind of “roll-up” of all of the transactions. In all blockchains the state database is used by validators or miners to build new blocks on top of. The blockchain node’s execution runtime (think: virtual machine) will use state database as its persistent storage.

The state database can contain many different kinds of information but it usually stores items by “account”. Those accounts are like a page of information with a unique identifier, the “address”. An account can be an externally owned account (EOA) or it can be some other record generated by some on chain code (a smart contract). In the simple case of Bitcoin, all accounts are EOAs and the data in the accounts database is just the current balance of that account.

What is special about the Solana accounts database? Here is a brief list of some the design considerations:

  • Accounts can be marked as read-only and/or executable, This is unlike Ethereum and EVM compatible chains where the executable code and the data are stored separately and the code segments are immutable.
  • Accounts have an owner which can be set. The owner is allowed to modify the state of an account.
  • The size of an account can grow from inception. It may grow by a maximum of 10 KiB per instruction to a maximum size of 10 MiB per account. The maximum total of all account data size changes for a transaction caps at 20 MiB.
  • Each account has a balance of SOL.
  • A periodic “rent” is collected from all1 accounts for the on-chain storage they are providing.
  • There is some other metadata, but this is unimportant for our discussion.

Account Databases Implementations


Typically the state database for a blockchain is stored as a key-value store, mapping account address to account data and metadata. There are several off-the-shelf key-value stores that are used in the wild for this purpose, namely LevelDB, RocksDB, MDBX (in the case of Erigon), and some custom implementations (Solana’s AccountsDB).

LevelDB and RocksDB are log-structured merge (LSM) trees style KV stores. RocksDB by most if not all metrics is a better implementation for most use cases. Solana at some point used RocksDB for its accounts database. There are many reasons why these databases are not suitable for Solana’s account database namely that:

  • The concurrent writes that Solana has to be able to do is not fast enough under RocksDB. In theory about 64 threads could be writing to an assortment of un-contended records in the database.
  • RocksDB and LevelDB assume you want to do range scans and so records are ordered. This is not needed for the Solana runtime access pattern. Disregarding the order of records in the database could have runtime benefits.
  • I would assume that the write-ahead logging (WAL) system in RocksDB was glacially slow and potentially unnecessary.
  • It is unclear whether RocksDB is intelligent about small changes to large records. This is a common operation and would need to be addressed.
  • RocksDB’s in-memory cache is likely not granular enough to be efficient.
  • Compaction, a necessary process for LSM style KV stores, is costly and time consuming.

And this is why Solana moved to their own custom implementation. I have yet to pour over it completely as it is several thousand lines of code across a dozen files. Their custom implementation is supposedly faster on concurrent writes and reads.

Design of A New Account Database


Now let us get into what a new Solana accounts database might look like. Of course, before we do any of this, we will need to go out and measure the real world access patterns to validate the following hypotheses.

Some of the goals of such a high performance accounts database include:

  • As few reads or writes to disk (i.e. low read and write amplification).
  • Efficient modification of account data.
  • Highly concurrent access without sacrifice to performance.
  • Reliability, availability and correctness.
  • Fast rollback.

Given these goals, we can start to postulate on a possible design. We can assume that we have a very high spec machine with lots of threads and lots of memory and an excess of SSD disk. That being said, the accounts database is much larger than main memory so we will have to store it on disk. This means we are very much I/O constrained. This means we need a really good cache tailored to the use case.

The new design would look as follows:

  • Disk storage format: We want to lay out a large hash table structure on disk. This will likely result in having some set of files organized to contain data records. Files will have a header containing a hash table of the records in the file. There will also be a bloom filter for quick existence checking of records in the file. It is unclear yet but it may be promising to organize these files in a prefix trie, such that each address describes a path down the trie to the location of the record within a file(s). If the majority of this trie can be cached in memory, its likely that we can make this really fast.
  • In-memory cache: A very efficient hash table! I believe that we will need to have a cache granularity that is much smaller than the account level. I believe we can get away with caching only requested (account, 256 byte block of data). This is because when data is modified or read in these accounts, its likely a much smaller portion than the whole. This will increase cache hit rates as more space will be available to cache more relevant data. We will likely use an LRU cache (more precisely a CLOCK cache) with some preference for records which are used very frequently.
  • Diff write-back log: Every time a record modification is committed, we will take the 4KiB page it resides on and add it to a buffer for subsequent writing. A hash table for pages in the buffer will be maintained so that more edits can be made to the same page while it is still in the queue. This will avoid re-writing a whole 10 MiB account just because a few bytes changed within it.
  • Compaction: We want a system which avoids compaction at all costs but not if the alternative is inefficient.
  • Concurrency control: Due to the way the Solana runtime operates, we can likely use a lock-free transaction scheme when executing a block’s transactions. This is because we know before hand that the transactions which we wish to execute will not contend. This is may be a bad assumption.

This is very much a high level sketch, and I imagine it will change as we approach actual implementation. I am very much still warming up to Solana so this could all be wrong.

Some Further Considerations


These are mostly just notes and questions that I still have:

  • Do we need locks? Locking granularity?
  • How do we fill in the database on startup? On restart? On fault? We may need WAL (yikes!) or a relaxed WAL algorithm.
  • How do we handle the scans and mass updates required for rent payment?
  • How to handle resizing of records?

Next Steps


Well there is a lot here to do and even more to continue thinking on! But before we can do any implementing there are several things that need to be quantified.

I think we need to ascertain the access pattern for accounts as it exists the real world. By measuring the read and write patterns we can determine what the right in-memory caching scheme is as well as the right layout on disk.

We also need to measure the distribution of accounts by size and data entropy and diversity of content. If many accounts are similar and do not change, we can optimize for this.

I have already done some benchmarking of RocksDB but would like to do it with an analogue of Solana access patterns (with optimal settings). I would like to do the same with the Solana AccountsDB to obtain a baseline for performance.

I am hoping this will be the content for part 2! Thanks for reading!

Footnotes


  1. Technically you can be rent exempt if you provide a certain amount of SOL to the account.