Battle-Testing Ethereum’s Finality

This blog was posted on our Medium page.

For a brief period during May 11th & 12th, transactions on Ethereum were not finalizing. More than 60% of validators went offline, and transactions were not finalized despite being successful. The offline validators had to come back online quickly to avoid being caught in an inactivity leak.

Fortunately, Ethereum is designed to handle these situations and eventually recover without intervention. If you’re not familiar with what we’re referring to, don’t worry, we’ll discuss the intricacies of Ethereum’s finality, the implications of the inactivity leak, and delve into the liveness and safety debate for Ethereum in this article.

Understanding Transaction Finality

The inactivity leak on the Ethereum mainnet was a first-time occurrence. To fully understand it, we need to grasp the concept of transaction finality on Ethereum. In Ethereum’s Proof-of-Stake (PoS) model, transaction finality is achieved through a process similar to a democratic election. Once transactions are finalized, they are permanently recorded on the blockchain and cannot be changed. Validators review and approve each block, casting their votes to determine finality and immutability. A block achieves finality when it gathers over two-thirds of the total validator pool’s votes.

It’s important to note that anyone who wants to validate transactions on Ethereum must stake 32 Ether, and if they act dishonestly, their stake can be slashed, and they can eventually be ejected from the validator pool when their balance drops to 16 ETH. If you want to learn more, you can refer to our short thread on finality a few days before the incident.

But let’s take a step back and understand the term finality and what it means. In settlement systems outside of Ethereum, finality refers to the moment when an obligation or asset transfer becomes unconditional and irrevocable. This ensures that all involved parties have fulfilled their obligations, even in cases of insolvency or bankruptcy, as outlined in international standards and domestic laws governing financial market infrastructures.

In other words, once an entity or person makes a payment to a counterparty, the transaction should not be reversible. This is particularly crucial for large-value payment systems that settle significant amounts of money daily. Transactions must be irreversible; otherwise, businesses wouldn’t be able to move forward.

Before The Merge, Ethereum ran on a Proof-of-Work consensus mechanism similar to Bitcoin, providing probabilistic finality. There was an economic guarantee that as more blocks were added to the canonical chain, it would become increasingly expensive for attackers to reorganize the blocks. However, the move to the PoS model using the Casper protocol for finality aimed to provide stronger guarantees than PoW.

Non-finalization Event Breakdown

The non-finalization event occurred twice on May 11th & 12th, lasting for 3 and 8 epochs, respectively. On May 11th (Thursday), during epoch 200551, network participation dropped to 40%. Missed slots increased to 18 out of 32, a significant deviation from the typical range of 1 or 2, causing blocks to not finalize. Each block carries attestations, and the reduced number of blocks leads to fewer attestations available for finalization.

Time(seconds) between each block
Time(seconds) between each block

On May 12th (Friday), the non-finalization event occurred between 17:20:23 to 18:24:23 UTC. Transaction times were over a minute slower compared to the usual average block time of 12 seconds. Interestingly, despite the reduced block space, gas fees did not exceed the highest daily average. Users transacting on Ethereum during the incident likely wouldn’t have noticed anything significantly wrong.

Blocks per hour
Blocks per hour

The fact that more than 60% of blocks were missing indicated that more than one consensus client were experiencing issues. The cofounders from Prysmatic Labs noted that the incident was due to unexpected behavior that client teams Prysm and Teku didn’t handle well. These clients were receiving old attestations, causing Prysm to replay a large number of states to verify the chain’s validity. The clients were overwhelmed with these computations, leaving them with limited bandwidth to respond to block production and attestation.

Additionally, the increased number of validators post-Shapella and maximum deposits from the activation queue caused increased hashing and latency, leading to a rapid decline in network performance. It’s worth mentioning that there are currently over 590,000 validators on the mainnet (at time of writing), compared to the testnet with only 400,000 validators during client team testing.

Since the Shapella upgrade allowed withdrawals, there have been many more deposits to the mainnet. Initially, partial and full withdrawals outweighed deposits, but within a few weeks, the trend reversed due to increased validator rewards and lower staking risk. If someone wants to join Ethereum as a validator now, they would have to wait in a queue for more than 30 days, as only nine validators are currently allowed to join the network per epoch, totaling 2,025 per day.

Source: https://wenmerge.com
Source: https://wenmerge.com

The Inactivity Leak Takes Effect

So, what actually happened during this incident? In the first non-finalization event, the network was able to recover without any penalties. However, the second occurrence lasted for a full hour, triggering the inactivity leak. When a network partition occurs, and validators cannot communicate with each other, one side will start leaking balance at an exponential rate.

This causes the offline validators to have a diminished stake, while the online validators slowly gain a larger percentage of stake and gain control over the network to achieve over two-thirds of attesting validators needed for transaction finalization. For offline validators, after approximately 3 weeks, they will lose 16 ETH and be ejected from the validating pool.

The inactivity leak resulted in the burning of 28 ETH, which amounts to less than 0.0006 ETH per offline validator or 0.002% of a validator’s 32 ETH deposit. During this time, approximately 50 ETH in revenue was lost due to missing attestations. Regular Ethereum transactions, such as Dex swaps, NFT minting, and yield farming, continued to be executed on the mainnet, meaning end users likely didn’t notice anything out of the ordinary. This showcases Ethereum’s immutability.

Even in highly challenging situations, such as a large-scale crisis or emergency, as long as validators worldwide have internet access, they can help keep Ethereum running. This is all part of the liveness vs. safety debate, which we’ll discuss further.

Reasons Ethereum Kept Running

It’s important to note that while Prysm and Teku clients went down during both non-finalization incidents, Lighthouse clients remained operational, processing transactions on Ethereum as usual. The affected client teams quickly released hot fixes inspired by how Lighthouse handled the situation effectively. Each consensus client has different implementations. The diverse range of clients played a crucial role in keeping Ethereum running, even when more than 60% of clients were offline.

Source: https://clientdiversity.org/#distribution
Source: https://clientdiversity.org/#distribution

Compare this to two years ago, when Prysm nodes made up approximately 65% of the network. If Prysm nodes had gone offline along with other nodes, we would have faced a more significant problem, rendering Ethereum almost unusable. Today, Ethereum is not a monolithic system but a composition of diverse and distributed components. Thanks to the open-source ethos, anyone can create their own consensus node to validate Ethereum, further increasing client diversity and resilience.

Another factor that contributed to keeping Ethereum running during the non-finalization events was the LMD-GHOST protocol. LMD-GHOST ensures liveness by providing a fork-choice rule that helps maintain the continuity of the blockchain. Validators are able to choose which blocks to support and build upon, even if some validators are inactive or not participating.

This means that as long as there are active validators, the blockchain can continue to grow, and new blocks can be added to it. LMD-GHOST achieves this by using the weights of subtrees created by forks as a heuristic and assuming that the subtree with the heaviest weight is the “correct” one. This ensures that validators will always end up at a leaf block, which defines a canonical chain.

Liveness vs Safety Tradeoffs

The challenge of achieving both liveness and safety in open-source distributed ledger networks like Ethereum stems from the CAP theorem, which states that it is impossible for a distributed system to guarantee consistency, availability, and partition tolerance simultaneously. In the event that validator nodes are unable to communicate with each other, they must prioritize either consistency or availability.

For a general-purpose blockchain like Ethereum, it is important to prioritize liveness. If Ethereum were to prioritize safety over liveness, in the event of a network partition, the network would halt and no transactions would be able to go through. LMD-GHOST provides some measure of ‘safety’ in this scenario while the network is unable to achieve full safety or finality. However, applications built on top of Ethereum as a base layer can prioritize safety as needed. For example, during the non-finalization event, dYdX paused deposits to ensure the safety of its users’ transactions.

But what would happen if the inactivity leak persisted indefinitely? Applications that relies on Ethereum’s finality would be affected, such as optimistic rollups with a 7-day fraud proof window, as their finality is tied to Ethereum’s finality. An optimistic outcome would be for the inactivity leak to eventually recover Ethereum’s finality, and any offline validators coming online could resume attesting to the latest block they perceive as the canonical chain.

In the case of forks on Ethereum, the community can apply common sense to determine which fork represents the result of the originally agreed-upon transactions, although it should only be considered as a nuclear option.

Closing Thoughts

The recent non-finalization events on Ethereum serve as a stark reminder of the complex challenges inherent in maintaining an open blockchain network. Ethereum is not a finished product but an ever-evolving network undergoing constant heavy research and development.

These incidents have demonstrated that Ethereum’s inactivity leak mechanism functions as intended, and no catastrophic events occurred that could have had a significant impact on the network’s operation. Weaknesses in client implementations were also identified and addressed. It currently takes an average of 2.5 epochs for transactions to finalize, but future developments such as single-slot finality, which allows for finalization within the same slot a proposal is made, will be interesting to watch.

Subscribe to Etherscan
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.