Back to Blog

Table of Contents

Highlights

A Deep-dive into Solana's AccountsDB

Written By

Sig Engineering & Anza

August 12, 2024

What is AccountsDB

In Solana, all data is stored in what are referred to as "accounts”. The way data is organized on Solana resembles a key-value store, where each entry in the database is called an "account". AccountsDB is a database of accounts stored by validators and used to track accounts data. As such, it is a critical component of Solana.

Snapshots

When a new validator starts, it must first catch up to the current state of the chain; in Solana, that is accomplished through Snapshots. Snapshots contain the full state of the blockchain (including all accounts) at a specific slot. They are requested/downloaded from existing validators in the network and are used to bootstrap new validators (instead of starting from Genesis).

A snapshot contains two main components:

A metadata file, which contains information about the latest block and the state of the database
An accounts folder which contains files that hold the account data.

Below is a diagram of a standard Solana snapshot at slot 196493007, with the two components highlighted.

Note: Snapshots are built and downloaded as a tar archive compressed with the Zstandard format, so the file extension is usually .tar.zst. The first step whenever loading from a snapshot is to uncompress and unarchive it into the directory layout as shown above.

Account Files

One of the main components in a snapshot is the account files which contain all the account data for a specific slot. Each file is organized as a list of the accounts as bytes.

Note: in the Rust implementation, the files that contain the accounts are referred to as AppendVecs. This name, however, is outdated since appending is no longer used in the codebase, so we will refer to them as account files in this post.

Reading this data involves:

Decompressing and unarchiving the snapshot into separate account files
Individually loading each account file into memory (via the mmap syscall)
Reading the entire length of each file and organizing it into structures that can be used by the AccountsDB code

The account file format appears as:

Note: Since an account’s data is a variable length array, we first need to read the data_len variable, so we know how many bytes to read into the data field. In the diagram, the Account Header stores useful metadata about the account (including data_len) and the Account Data contains the actual data bytes.

Pseudocode which parses all the accounts from a file is included below:

The Account Index

Now that we understand how the account files are organized to store account data, we also need to organize this data by the account’s public key, i.e., we need to build a mapping from a given pubkey to the location of the corresponding account in the file. This mapping is also referred to as the Account Index.

More specifically, we need a mapping from a pubkey to a tuple of (file_id, offset) where file_id is the name of an accounts file to read, and offset is the index of the account bytes in the file. With this mapping we can access the associated account of a pubkey by opening the file associated with the file_id, and reading the bytes starting at offset.

An account can also change across slots (e.g., a snapshot might include multiple versions of the same account across different slots). To associate each account's Pubkey with a collection (or "Vec") of multiple file locations. In Agave, there will always only be a single account file per slot. This means we can structure the Account Index as Map<Pubkey, Vec<(file_id, offset)>>, which maps from a pubkey to a collection of account references.

To read the latest state of an account, we would also need to find the reference with the highest slot, so we can also track the slot associated with each file_id.

Because there are so many accounts on Solana, the index can become very large which can lead to large RAM requirements. To reduce these requirements, Solana utilizes a disk-based hashmap.

Account references are by default stored on disk. When a pubkey is accessed and read from disk, its account reference is cached in the RAM-based hashmap for faster accesses later.

When reading accounts for a given pubkey, the general flow will be:

Check the RAM index first,
If the pubkey is not there, check the disk-based index,
If it's not there, then the account doesn't exist.

Reading Accounts

If we want to read the data from a specific account, and we know the account’s pubkey, we find its location in the accounts file by looking up its pubkey in the account index. We'll pick the account reference from the most recent slot, and then read the data from that location in the accounts file. This data can be directly interpreted by the validator as an Account structure.

Writing Accounts

Another key component of AccountsDB is tracking new account states. For example, when a validator processes a new block of transactions, the result is a batch of new account states at a specific slot. These new states then need to be written to the database.

The two main steps for writing these accounts include:

Writing the account data into a new accounts file associated with the slot
Updating the index to point to the location of these new accounts (ie, append a new (slot, file_id, offset) item for each account in the file)

Background Threads

As new account states are written to the database, we must ensure memory is used efficiently by performing 4 key tasks:

Flushing,
Cleaning,
Shrinking,
and Purging

Flushing

To reduce memory requirements, data is periodically flushed from RAM to disk.

For example, the index’s RAM-based hashmap acts as a cache for recently accessed data, and when that data has not been used in a while, it's flushed to disk to make room for more recent account indexes.

Another example is how new account states are first stored in RAM and only flushed to disk once the associated slot becomes rooted. This approach reduces the number of (slow) disk writes, only writing rooted data.

Cleaning

In order to limit memory growth, we must clean old data.

For example, if we have two versions of an account, one at slot 10 and another at slot 15, and slot 15 is rooted (i.e., will not be rolled back), then we can remove the account and index data corresponding to the account at slot 10.

Another example of cleaning is when an account drops to zero lamports, in which case we can delete the full account.

Note: The cleaning phase wipes out the index entries, but it does not reclaim the region of storage in the account file that was occupied by the account data. After cleaning, this region is considered garbage data in the account file, which wastes storage space and network bandwidth.

Shrinking

After accounts are cleaned, account files will contain “alive” and “dead” accounts where “dead” accounts have been “cleaned” and therefore are no longer needed. When an account’s file has a small number of alive accounts, the alive accounts can be copied to a smaller file without the dead accounts, saving disk memory. This is called shrinking.

Purging

Lastly, we can also purge entire account files for slots which are forked off of the rooted chain. For example, if one of the branches of a fork becomes rooted, then we can delete all account files associated with the non-rooted branch.

Implementation Details

Having covered the high-level components of how AccountsDB works, we’ll dive into the implementation details next.

Loading From Snapshots

We’ll begin with a more detailed outline describing how a validator loads from a snapshot.

Unpacking the Snapshot

Downloading a snapshot from a peer in the network will usually include:

1. A full snapshot, and

2. An incremental snapshot.

Full snapshots include all the accounts on the network at some specific slot. Incremental snapshots are smaller snapshots and only contain the accounts which changed from a full snapshot. For example, if the network is on slot 100, the full snapshot could contain all accounts at slot 75, and a matching incremental snapshot could contain all accounts that changed between slot 75 and slot 100.

Full snapshots can be expensive to create often because they contain all accounts on the network whereas incremental snapshots are relatively cheap to create since they only contain a subset of the network’s accounts. In light of this reality, the typical approach for validators is to create full snapshots every once in a while and create/update incremental snapshots more frequently.

Full snapshots adhere to the following naming convention format:

snapshot-{FULL-SLOT}-{HASH}.tar.zst

For example, snapshot-10-6ExseAZAVJsAZjhimxHTR7N8p6VGXiDNdsajYh1ipjAD.tar.zst is a full snapshot at slot 10 with a hash of 6ExseAZAVJsAZjhimxHTR7N8p6VGXiDNdsajYh1ipjAD.

And incremental snapshots follow the naming convention format:

incremental-snapshot-{FULL-SLOT}-{INCREMENTAL-SLOT}-{HASH}.tar.zst.

For example, incremental-snapshot-10-25-GXgKvm3NMAPgGdv2verVaNXmKTHQgfy2TAxLVEfAvdCS.tar.zst is an incremental snapshot which builds on a full snapshot from slot 10 and contains the account changes up until slot 25 and has a hash of GXgKvm3NMAPgGdv2verVaNXmKTHQgfy2TAxLVEfAvdCS.

A matching snapshot and incremental snapshot will have the same {FULL-SLOT} value. When the validator is starting up, since a validator can have multiple snapshots downloaded at once, it will find the latest snapshot and matching incremental snapshot to load and startup from.

Downloading Snapshots

To download a snapshot, the validator starts participating in gossip to join the network and identify other nodes. After a while we look for nodes who:

Have a matching shred version (i.e., our network version/hard-forks matches theirs)
Have a valid RPC socket (i.e., we can download from them)
Have shared a snapshot data type through gossip

The snapshot hash structure is a gossip datatype which contains:

The slot and hash of the largest available full snapshot
And a list of slots and hashes of available incremental snapshots

Note: the snapshot hash structure can be viewed here.

The validator then begins to download the snapshots, prioritizing snapshots from higher slots. If we have a list of trusted validators (provided through the CLI on startup), we only download snapshots whose hashes match the trusted validators hashes.

Then for each of these nodes, based on the information in the snapshot hash structure, we construct the file name of the snapshot:

full: snapshot-{slot}-{hash}.tar.zstd
incremental: incremental-snapshot-{base_slot}-{slot}-{hash}.tar.zstd

Using the node’s IP address, RPC port, and the filepath, we start downloading the snapshots. We periodically check the download speed and make sure it's fast enough, or we try to download from another node instead.

Once the validator finds available snapshots, it will first uncompress and unarchive them, leading to a directory layout as described below.

The main code path which unpacks a snapshot and builds the corresponding bank and AccountsDB starts by running the function called bank_from_snapshot_archives.

Note: The bank is a data structure representing a single block/slot which stores information including the parent bank, its slot, its hash, the block height, and more.

Note: For a specification of snapshots, see Richard Patel’s Informal Guide to Solana Snapshots.

Loading Account Files

The first step to unpack a snapshot is to decompress and load the account files which happens in the verify_and_unarchive_snapshots function.

In this function, multiple threads are spawned to decompress and unpack the snapshot from its archive. After unpacking the snapshot, each of the account files are sanitized to ensure they contain valid accounts, are mmap’d into memory, and are tracked in the database.

Note: While a validator is building a snapshot for a slot and packaging these account files, due to the implementation, the validator will continue to write account data to the account files for more recent slots. Because this data does not correspond to the snapshot’s slot, this data should be ignored. This means that when reading accounts from the file, we need to read the length from the snapshot metadata file (more specifically the AccountsDB metadata) to know the last index of useful data to read.

Note: In older snapshot versions, there could be more than one account file per slot, but in the new versions, there will only be one file per slot.

Note: Account files are also referred to as storages and stores in the Rust implementation.

Loading Snapshot Metadata

After loading the account files, the validator loads the AccountsDB and bank metadata from the snapshot metadata file using the Bincode format in rebuild_bank_from_unarchived_snapshots.

Note: Bincode is an encoding scheme (https://github.com/bincode-org/bincode)

The bank metadata includes info such as the epoch schedule, stake information, the block height, and more. The AccountsDB metadata includes info such as the length of the data used in each of the account files, a cumulative hash of all the accounts to identify data corruption, and more.

Note: The snapshot metadata file is located by the path /snapshots/{slot}/{slot} in the unarchived snapshot where {slot} is the slot for the snapshot.

Note: For a list of the full fields, see src/core/snapshot_fields.zig in the Sig repository or see BankFieldsToDeserialize and AccountsDbFields in the Rust implementation.

Generating the Account Index

Using the AccountsDB metadata, the validator builds the AccountsDB structure using the reconstruct_accountsdb_from_fields function. It then loads the account files into the database using AccountsDB :: initialize and generates the index using AccountsDB :: generate_index.

The architecture of the index divides the index into many buckets. Each bucket is a mini index that represents only a small portion of the accounts. Each pubkey is associated with a specific bucket based on the pubkey’s first N bits. Each bucket contains both RAM (using a simple HashMap - referred to as an InMemAccountsIndex in the codebase) and disk storage (referred to as a Bucket in the codebase using a mmap’d file)

Note: the RAM map uses a Vec<(Slot, File_id, Offset)> as its key and is also referred to as a slot list in the Rust codebase.

Note: Bucketing based on the first N bits is also similar to their approach to crds-shards (as discussed in Part 1 of this series, covering the Gossip Protocol).

Note: The default number of buckets used in the rust code is 8192.

The full architecture is described in the figure below:

Note: To ensure all indexes are stored in RAM when running the validator, you can use `--disable-accounts-disk-index`

These components are stored in the AccountsDB struct in the following layout:

Background Threads: Flushing Data

Since the RAM-based index stores the most recently accessed index data, background threads (referred to as BgThreads in the codebase) are used to flush out old index data to disk to make room for more recent index data. This flushing logic is done using the AccountsInMemIndex :: flush_internal function. The RAM-based index entries are flushed to disk/bucket memory once they “age out”. The age parameters are configured based on how often we want to flush index entries to disk and is related to when its slot is rooted and a few other reasons.

Disk Based Indexes

While RAM indexes are a hashmap backed by RAM memory, disk indexes are a hashmap which is backed by disk memory. While both are hashmap implementations, the disk index was implemented from scratch in the Rust repository and is more complex to explain.

The main structure of disk indexes is the Bucket structure which contains a BucketStorage structure that stores a file-based mmap’d file. This mmap’d file will be the backing memory for the hashmap.

To support put, get, and delete methods, BucketStorage requires an implementation of a trait called BucketOccupied to identify which memory is free and which memory is occupied.

This means the BucketStorage stores:

An mmap’d file (mmap variable in the figure below), and
A struct implementing BucketOccupied to track which indexes are free/occupied (contents variable in the figure below)

There are two implementations of the BucketOccupied trait in the codebase:

The IndexBucket
And the DataBucket

Each Bucket instance contains a single IndexBucket and a vector of DataBuckets, as shown below:

When there is only one slot version of the account, the index is stored in the IndexBucket. When there are multiple slot versions of the account, the index data is stored in a DataBucket.

The Index Bucket

The IndexBucket stores a list of `IndexEntries` in the mmap’d file to represent the index data and uses a BitVec to store an enum for each index to identify if it's free or occupied.

The possible enum values for each index include:

Free: the index is not occupied
ZeroSlots: the index is occupied but hasn’t been written to yet
SingleElement: there is only a single slot stored for the pubkey and the index data is stored directly
MultipleSlots: there are multiple slots stored for the pubkey (and the index info is stored in one of the DataBuckets)

The organization of the IndexBucket is explained in the diagram below:

Notice how the mmap’d file contains a list of IndexEntries and the BitVec identifies how to read the contents:

In the diagram’s second entry, the BitVec’s value is 2 which is OneSlotInIndex, so the single_element field in the IndexEntry is read to directly access the account reference value.
In the diagram’s last entry, the BitVec’s value is 3 which is MultipleSlots, so the multiple_slots field in the IndexEntry is read which is used to read the account reference from the data buckets (discussed more in the next section).

Note: Since there are 4 different enum values to represent them efficiently, we only need 2 bits, this means we can use a u64 to represent 64 / 2 = 32 different indexes. This is what the BitVec structure does.

Note: The full implementation is similar to a flat hashmap backed by disk memory. The pubkey is hashed to get an index to start the search from, followed by linear probing to find the matching key, however, because it’s a relatively simple implementation, it’s still slow compared to the RAM hashmap.

The Data Bucket

When reading the IndexBucket, and the BitVec’s value for an entry is MultipleSlots, that means the pubkeys index data is stored in one of the DataBuckets`. The MultipleSlots struct stores:

1) The number of slots which are stored in the index (in the num_slots field) which is used to compute what DataBucket to read from
and 2) The offset into the DataBucket to read from (in the offset field). the code is shown below:

Since the BucketStorage struct contains N DataBuckets, which DataBucket to store a given index in is decided by the number of slots the pubkey contains in power-of-two increments. For example:

index data which contains 0-1 slots are stored in the first DataBucket
index data which contains 2-3 slots are stored in the second DataBucket
index data which contains 4-7 slots are stored in the third DataBucket
…

The logic for finding a pubkeys index and reading the corresponding value is written in Bucket :: read_value in bucket_map/src/bucket.rs. The code for reading the corresponding value is shown below:

Generating the Index

While a validator starts up, the index is generated using the function generate_index_for_slot which is called on each account file (in parallel) to extract all the accounts out of the files and place each account reference in their corresponding index bin.

This is also where secondary indexes are generated.

Secondary Indexes

Consider if you wanted to get all the accounts in the database that has a specific owner. To do this you would need to iterate over all the pubkeys in the index, access the account in the account file, and do an equality check on the owner field of the account. This is what the infamous RPC call getProgramAccounts does in the code.

With a lot of accounts, this linear search can be both very expensive and very slow.

Since this search is very common, secondary indexes store a list of pubkeys which have a specific owner. This allows you to access only the full account data of pubkeys which you know have a specific owner, which can make the process much more efficient.

Features, Built-in Programs, and System Calls

After all the accounts have been indexed, the bank is created from the metadata, stores a reference to the full AccountsDB and is finished using the Bank :: finish_init function which activates features and loads built-in programs and system calls.

Features are activated if the snapshot slot is greater than the feature activation slot. This logic is handled in Bank :: apply_feature_activations.

Some example features include:

blake3_syscall_enabled
disable_fees_sysvar
etc.

The full list of features can be found under the FEATURE_NAMES variable in the codebase or on the GitHub wiki page here.

Built-in programs are also added to the bank using Bank :: add_builtin where all the built-in programs can be found under the BUILTINS variable name. This includes programs such:

The system program
The vote program
The stake program

Lastly, the sysvars are loaded in the function Bank :: fill_missing_sysvar_cache_entries which loads variables such as the clock, epoch schedule, rent, etc.

Reading and Writing Accounts

After all the accounts have been indexed, the next important component to understand is how accounts are read from and written to.

Reading Accounts

When processing a new block, the validator first loads in the associated accounts based on the pubkeys defined in the transaction. When searching for a pubkey’s account it does the following:

Computes the corresponding bin the pubkey should belong to by considering the first N bits
It then searches the RAM memory of that bin
If it’s not in RAM, then it will try to look it up on disk
- If it finds the index data on disk, it will then store the data in RAM for faster lookup next time

In the Rust code, the code path of reading accounts starts with AccountsDB :: do_load_with_populate_read_cache which calls AccountsDB :: read_index_for_accessor_or_load_slow to find the index data corresponding with the pubkey.

This function first finds the corresponding index bin and then calls InMemAccountsIndex:: get_internal which searches RAM and disk for the index data.

A code snippet of the main flow is below:

After finding the index data for the pubkey, since the data is a vector of account references across different slots, the largest slot is used to identify the most up-to-date version of the account.

The index data can then point to three possible locations:

In an accounts file on disk (which can be looked up using the file_id and offset)
The read cache
The write cache

We’ve already discussed how to read from an account file, but the two locations we haven’t talked about yet are the read and write cache. The read cache is another cache layer which is updated with the full account data (different from the index cache) after reading the account from an account file. The write cache contains full account data in slots which have not been rooted yet (this is discussed more in the next section).

In the code, the account is read using retry_to_get_account_accessor and LoadedAccountAccessor :: check_and_get_loaded_account which loads the account for each of the three possible locations.

Writing Accounts

After processing a block of transactions, we now have a set of new accounts corresponding to a specific slot that we want to store and index. To do this, we first store the accounts into a write cache called AccountsDB.accounts_cache and update the index to point to the cache location.

Periodically, account data in this cache, which corresponds to a rooted slot, is flushed to disk using a background thread (discussed in the next section).

The code flow starts with `store_cached` which collects all the accounts in a block which were set as write using collect_accounts_to_store and then they are stored in cache and the indexed using the function store_cached_inline_update_index.

AccountsDB Background Service

Since we’ve discussed how reads/writes work in AccountsDB, next we’ll discuss how AccountsDB ensures memory is used efficiently.

This includes 4 main background tasks:

Flushing: writing data from RAM to disk
Cleaning: finding old accounts or zero-lamport accounts and freeing memory
Shrinking: shrinking account files with a large number of dead accounts
Purging: removing forked account files

Flushing, Cleaning, and Shrinking are done in the AccountsBackgroundService structure which

Gets the current root bank
Periodically flushes the rooted slots
Cleans the accounts which were flushed
And lastly, runs the shrink process.

The main code is explained below:

The full flow is explained in the diagram below:

Flushing

The first background task is Flushing, which reads from the write cache and pushes the accounts associated with rooted slots to disk via a new account file. Each rooted slot will have an associated account file that contains the accounts which changed in that slot.

The full code path to 1) collecting rooted slots and 2) flushing these slots to disk, starts from bank.flush_accounts_cache().

bank.flush_rooted_accounts_cache() collects unflushed rooted slots by reading from bank.maybe_unflushed_roots, iterates over the roots in reverse order and flushes each one to disk using flush_slot_cache_with_clean.

The main function which stores the accounts to disk is AccountsDB :: store_accounts_custom with its store_to parameter set to Disk, This first computes the total number of bytes required to store all the accounts, allocates a new account file big enough, and then writes the accounts to it.

After the accounts are written to disk, the index is updated to point to the new account locations with AccountsDB :: update_index.

Note: Notice how the upper function flush_slot_cache_with_clean, is suffixed with with_clean. This hints that the function also does the task of cleaning (more fully covered below) while flushing, which it does.

Note: The unflushed roots are iterated in reverse order because we only flush the first occurrence of a pubkey. For example, if the unflushed_roots = [12, 19] and a pubkey exists in both roots, we’ll only flush the pubkey account in slot 19. To implement this, the code uses a HashSet of pubkeys in a closure called should_flush_f that is populated as the slots are flushed.

Note: This cleaning only considers the roots which are also in the cache so it is not technically a full clean.

While flushing rooted account states to disk, sometimes the new rooted state can lead to dead accounts. While iterating over the accounts to update the index, the ‘reclaims’ array is populated with the old account data considered dead. The logic for considering account files ‘dead’ and queuing the files for shrinkage is in AccountsDB :: handle_reclaims.

Depending on how the cleaning went, there can exist account files which have zero or a small number of alive accounts. If there are zero alive accounts, the account file is considered dead, and all the corresponding account data is removed. If there is a small number of alive accounts (‘small’ is defined by the number of bytes in the alive accounts), then the account file is added to the shrink_canidate array to be queued for shrinking.

Below is pseudocode to explain the full process:

After this, the new slots are added to `uncleaned_roots` which is later used to remove zero-lamport accounts.

Cleaning

Cleaning finds accounts with zero lamports or accounts that have old history that can be removed, deletes these accounts from the index, and then queues small enough account files to be shrunk.

The code path starts with AccountsDB :: clean_accounts, where it first collects all the pubkeys which need to be cleaned by reading the uncleaned_pubkeys variable which parses (pubkey, slot) tuples of other rooted slots in the AccountsDB :: construct_candidate_clean_keys function.

More info on unclean_pubkeys: once a new block is fully-processed, the bank is frozen using bank.freeze() where, among other things, it computes a bank_hash which uniquely identifies the block. This hash is a combination of other hash values including a hash of all the accounts modified in the block called the account_delta_hash. To compute this value the calculate_accounts_delta_hash function is called which also adds all the account's pubkeys to the uncleaned_pubkeys variable. Since the cleaning flow is called only after flushing, this cleaning will process all the accounts which were recently flushed.

For all of these pubkeys, the zero lamport accounts and the old accounts are handled separately.

To handle the old accounts, clean_accounts_older_than_root is called which purges slots which are less than the rooted slot (except for the most recent rooted slot), removing the value from the index’s slot list. If after this process the slot list is empty, then the entry is removed from the index.

During this process, the index values which are removed are tracked as reclaimed (same as the flushing process) which later counts the removed accounts as dead in their respective accounts file which then leads to the accounts file either being dead or a candidate for shrinking.

To handle the zero lamport accounts, the corresponding accounts file which holds the account has its alive_account_count reduced by one to define the account as dead and deletes the index data.

Then, similar to the end of the flushing flow, for each of the accounts files, if the file only contains a small number of alive accounts, it is added to the shrink_canidates variable to be queued for shrinking.

Shrinking

Shrinking finds account files with a small number of alive and removes the dead accounts reallocating the file to reduce the amount of used disk memory.

The code path for shrinking starts in the shrink_candidate_slots function which reads all the slots stored in the shrink_candidate_slots variable. For each of these slots, the do_shrink_slot_store function is called, which finds the alive accounts in the slot’s accounts file.

To decide if an account is alive it looks up the (pubkey, slot) in the index. If the tuple doesn't exist in the index, then the account is considered dead, and stored in the unrefed_pubkeys variable. Otherwise, if the account is alive (and the tuple is found in the index) then it's added to the alive_accounts variable.

After collecting all the alive accounts, it writes the accounts to disk in a new account file which is properly sized and deletes the old accounts file.

Purging

And lastly, when a new slot is rooted, banks which are part of a fork are purged from the write-cache and the index data is deleted by implementing the Drop trait on the bank struct.