Data availability and blockchain: all you need to know - Part 1
What is the data availability problem? | DA and Ethereum rollups | DAS sampling, sequencers and DA for ZK rollups: all the facets of the DA problem
Data availability, or DA, is a hot topic in Web3, but it can be tricky for regular users to understand. We’ll delve deep into the definition of DA and the challenges that it poses, before exploring the key solutions like Celestia and EigenDA in Part 2.
Data availability definition on Ethereum - wide sense and narrow sense
As rollups generate ever more transaction data, they need a safe way to store it so that they can continue scaling without creating congestion. This service is provided by Data Availability Layers - services that allow rollups to store large bodies of transaction data cheaply and make it permanently accessible to all interested parties, from full nodes on the L1 to end users.
However, before we get to the data availability (DA) problem in the context of rollups, we have to understand what data availability even means. Different sources propose very different definitions for what sort of data should be made available - and to whom.
In this guide to DA, we will move from the narrowest context to the widest, as if through circles on the water. We will do a deep dive into DA in Ethereum, data availability sampling (DAS), and key DA solutions like EigenDA, Celestia and Avail DA. In Part 2, we’ll focus on DA on Solana and how Pontem’s own upcoming DA solution for Lumio and Solana.
Data availability and full nodes
A full node in a network like Bitcoin and Ethereum is one that stores a complete and up-to-date copy of the whole blockchain state and executes all the transactions in a new block before approving it.
This mechanism gives rise to one of the biggest advantages of blockchains: trustlessness. A full node doesn’t need to trust the block producer or a central authority that a certain transaction is legit. Instead, when it validates a new block, it knows that all the transactions in it are valid, because it has personally processed and verified each of them.
Block producers (validators) need to broadcast block data to all full nodes so that they can download it and re-execute the transactions, essentially recreating the block. Ethereum mainnet doesn’t have any data availability problem in this narrow sense - but in the process, scaling issues arise:
a) as the full blockchain size keeps getting bigger, you need more resources to run a full node (1.45 terabytes on Ethereum in January 2024). This limits the potential number of full nodes.
b) as all the nodes have to process all the transactions, more time and resources are needed as the number of nodes grows.
Ethereum blockchain size in GB. Credit: Ycharts
Data availability and Ethereum light nodes
This is the second circle from the center in our analogy: we add light nodes to the mix. Such light clients don’t download the full blockchain state - only the header of each new block. Therefore, they can’t look inside and verify all the transactions. Instead, light Ethereum clients rely on a sync committee, composed of 512 full node validators who are randomly selected on a daily basis; the committee guarantees that all the data in the block header is correct.
This isn’t full data availability, but it’s good enough. And if a light node wants information on a specific transaction, it can always send a request to a full node.
Data availability vs. retrievability: new vs. historical data
The official Ethereum page makes an interesting distinction between data availability and data retrievability. Data availability is about nodes having access to the data they need to verify the current block, while retrievability refers to historical data. You can have data stored on the mainnet for a limited time and then deleted, and you’ll still have data availability, even if older data isn’t retrievable.
Note that major DA solutions like Celestia store both current and historical data, so they take care of data retrievability, too.
Data availability and Ethereum rollups
To understand the importance of rollups, consider these daily transaction numbers for September 11, 2024:
Base: 4,7 million transactions;
Arbitrum: 1.8 million transactions;
Optimism: 0.65 million transactions;
Ethereum mainnet: 1.2 million transactions.
Together, the three largest optimistic rollups processed over 7.1 million transactions in a single day, which is almost 6 times more than on Ethereum itself.
Credit: BaseScan.org
Rollups help solve Ethereum’s scalability issue by executing transactions that would otherwise have to be executed by full Ethereum nodes. Relative to Ethereum mainnet, those transactions are processed off-chain - to be precise, by a powerful node called a sequencer. The sequencer then posts block data in compressed form to the L1.
It’s with the appearance of rollups like Arbitrum and Optimism that data availability issues on Ethereum became obvious. A huge batch with hundreds of rollup transactions can result in just one transaction sent to the mainnet. So how do you know that the data posted by a rollup is correct?
Let’s see how the definition of the data availability problem changes when we add rollups into the mix.
Challenge no.1: making sure full L1 nodes have access to rollup block data
Full nodes on the mainnet need a way to know that the block data published by the sequencer is valid. At the same time, that data has to be kept compressed, otherwise it would be too expensive to post it on the mainnet - and in any case, Ethereum nodes can’t possibly go through all those millions of rollup transactions and check them one by one.
Here’s another way to phrase the rollup data availability problem: rollups must make their block data verifiable by full L1 nodes without having to publish all of it on the L1.
In optimistic rollups like Arbitrum, Optimism, and Lumio, this is achieved in two ways:
Up until the Dencun upgrade, rollups published transaction information on Ethereum in the form of calldata and allowed anyone to challenge it within the initial 7-day window. After that, you couldn’t challenge the data anymore, but the calldata would still be there on Ethereum. Calldata remains on the L1 forever, so that data is permanently available, but it’s very expensive to post: calldata used to account for 95% of rollup costs.
The Cancun upgrade introduced a new data object type called a blob, which is a much cheaper way to store rollup data.
Blob data is stored on the L1 for only 18 days, after which is deleted. From the point of view of L1 data availability, it’s not a problem, because Ethereum nodes need to verify only new incoming blocks, and 18 days are more than enough. But if rollups want their data to remain retrievable beyond that window, rollups need to use solutions like DA layers.
ZK (zero-knowledge rollups) solve the data availability problem differently. The ZK proofs that they compute and post to the L1 are in themselves proof enough that their block data is valid, so full Ethereum nodes don’t need to see what’s inside.
Challenge no.2: making sure sequencers make ALL data available
As Yuan Han Li points out in his excellent article “WTF is data availability”, it’s not enough to know that the data made available by a rollup sequencer is valid. What if the sequencer withholds part of the information, though?
This leads us to reformulate our rollup data availability problem: ensuring that rollup sequencers make all their data available to L1 nodes for verification without having to publish all of it to the L1
This can be achieved through data availability sampling (DAS), also known as data availability proofs (DAP). A node can verify the validity of sequencer data by downloading small samples of it - and once it has taken several samples from the same block, it can be pretty sure that the sequencer has been honest.
Who does the sampling, though? In the model introduced by Celestia, it’s light nodes that sample sequencer data, while full nodes use the samples to reconstruct the block. In Ethereum, DAS is still on the level of a proposal but should be introduced in a future upgrade.
Challenge no.3: giving all interested parties access to new and historical rollup data
In the widest possible sense, data availability means accessibility and verifiability of L1 and L2 transaction data for all interested parties:
dApps;
Rollups;
Light nodes;
End users.
This applies to both data in newly proposed blocks and those already included in the blockchain (historical data).
ZK rollups are a very good example why data availability in this wide sense is important. Their zero-knowledge proofs may be good enough for full Ethereum nodes, but for everyone else they are insufficient, because you can’t reconstruct rollup transactions (or state transitions) from these proofs. Users can’t even know their account balances for sure.
Why have we spent so much time on all the various meanings of the data availability problem? So that you can see that different sources interpret the issue differently, and that the definition of DA itself is complicated. When we talk about DA layers in the next article, we will refer to data availability in the widest possible sense, because DA layers like Celestia or the way our own Lumio on Solana deals with DA cater to all the groups of participants and not just full nodes - and cover both availability and retrievability.
Hopefully now you can see that data availability is an extremely important issue for the rollup space and for Ethereum scaling in general. In Part 2, we’ll see how projects like EigenDA, Celestia, and Polygon Avail are solving the problem of data availability on Ethereum. Stay tuned!