Merkle Trees: How Blockchain Data is Stored - A Simple Explanation

Merkle Trees: How Blockchain Data is Stored - A Simple Explanation

The Need for Efficient Data Storage

With every block added to a blockchain, its size grows overall. Storing each block as raw data would quickly become burdensome, and comparing two copies of one blockchain would require going through that large set of data. Merkle Trees exist to make blockchain storage and verification easier, without compromising on security. Their basic building blocks are hash functions.

Hash Functions

For example your wallet address is the output of a hash, the format being 0x-64-characters. Beyond that, each input changes the output randomly. Finally, you can’t determine an input purely by looking at an output. Hash functions are one way algorithms that create a “digital fingerprint” for any data

Hash functions are commonly used in traditional applications. For example, companies don’t store your password, they store a hash of it instead. When you log into an application, your input is hashed and then compared to the output stored in the application’s database.

Hash functions make verifying data extremely efficient. They can be stacked on top of each other to create a data structure called a Merkle Tree. Verifying a Merkle Tree is far more efficient than verifying the entire dataset. On Bitcoin, for example, a Merkle Tree is created for every block, containing all the transactions inside.

Merkle Tree

Above is a Merkle Tree. Each transaction in a block is hashed, starting at the bottom. Each hash is then paired with another, and hashed again. Ultimately, one hash is left, called the “Root Hash.” This structure of hashes itself is a Merkle tree.

When blockchains are stored on miner’s devices, this is the format they’re in. Each block has a data component, the Merkle Tree, as well as a block header. The block header contains general information about the block, like it’s number, and also the Root Hash from the previous block.

So on Bitcoin, miners don’t store a copy of every transaction. Instead, miners store Merkle Trees of every block. When a new block is created, it contains the Root Hash of the previous block. This creates a chain between blocks (i.e. a block… chain).

Data Storage on Ethereum

In Ethereum’s case, data storage is more elaborate. Ethereum doesn’t just process transactions, but also complex behaviors with smart contracts. To ensure everything is accurately tracked, Ethereum’s blocks contain 3 Merkle Trees, compared to Bitcoin’s single one.

The first Merkle Tree is based on all the transactions in a block, similar to Bitcoin. This is called the Transaction Tree. Next, the Receipts Tree contains every account on Ethereum, and what their current balance is at the end of that block. Finally, the State Tree contains all of the smart contract data on Ethereum.

These 3 Trees combine into a greater structure Ethereum calls a Patricia Merkle Tree. This is the same as a Merkle Tree, except each value in the Tree is assigned a “Key.” These Keys can be used to trace lines down the Merkle Tree, to isolate a specific transaction or event very easily.

Data Availability

When validators add blocks to the Ethereum blockchain, they must broadcast all the transaction data for that block to the other validators on the network. This is called making the data available, or Data Availability.

When validators receive a block of data, they execute all the transaction data inside and compute the results. Every single validator does this independently, and they all arrive at the same Root Hash for that block, checking each other’s work. Then, validators proceed to preparing the next block to be added to the blockchain.

The fact that every validator must execute every transaction indicates something very important: a blockchain can only handle as many transactions as its validators can execute. For the blockchain to process more transactions, in other words, each validator would be forced to execute more transactions.

This is the Data Availability Problem, and is at the heart of the Blockchain Trilemma. This is the concept that says decentralization, security, and scalability are extremely difficult to solve for at the same time. Increasing block size to increase scalability, for example, means that validators must all process more transactions, which requires improved hardware, which negatively impacts decentralization.

Thank you for reading, I hope you enjoyed this summary on Merkle Trees, Data Availability, and blockchain data storage!

For a more detailed write-up, visit my Substack. Sign up to receive a simple write-up on a blockchain concept, once a week. No ads, shills, or affiliations.

Stay kind. Stay curious.