The Infinite State Problem


#1

The problem of infinitely expanding state has been discussed as a side effect of other issues and as something that requires short-term fixes but hasn’t been addressed head-on in a way that prompts discussion around sustainable, long-term changes to the incentive structures for users of blockchains.

This post is not deeply technical but I hope it will prompt the necessarily technical discussion to support and challenge it.

Basically, I have two conclusions:

  1. We need to both model the actual relationship between state increases and node capacity so we can clearly determine when we’re above or below the line of sustainability.

  2. We should look into actively incentivizing behavior that releases state so it can be pruned away, eg refunding gas fees for releasing state, which can mitigate this effect.

Background

I recommend reading through these resources to get up to speed:

  1. Stack Overflow: How Big will the Ethereum state grow?

  2. Stack Overflow: What are the Ethereum disk space needs?

  3. A response to the above, The Ethereum Blockchain size has exceeded 1TB and yes it’s an issue (warning: Longread). A bit peripheral but goes into a lot of the counter-arguments like why lite nodes aren’t an answer and why this is important in the first place.

  4. Stack Overflow: What is state-trie pruning and how does it work?

Quick Math

Claim: Scalability brings significantly bigger state size problems than the ones already showing up in Ethereum.

Let’s say Ethereum’s 10 tps results in a potential storage growth rate of roughly ~50 GB per year. This seems low-end but why not.

Now enter the larger-but-not-scalable blockchains like EOS, which can process something like 1k-10k tps, so we’d imagine their state to increase 100x-1000x faster, at 50TB per year at capacity.

Now enter the scalable blockchains that claim upwards of 100k tps, and we’d imagine that full capacity would result in state increases of 500TB per year.

Yikes.

Nodes already can’t sync properly with the Ethereum network so there’s no way a node would be able to keep up with the kind of size these other networks are proposing.

As important as the rate of increase is just the fact that it is increasing infinitely and the fundamental design of the system encourages this.

Counter Arguments and Solutions

Moore’s Law

A common counter is that Moore’s Law will keep disk space expanding fast enough to keep up, thus mitigating the runaway effects of infinite state expansion. This seems nakedly wrong on two dimensions.

First, the numbers above clearly show disk space requirements exploding higher well beyond the capacity for any doubling effect to keep up with.

Second, there are other considerations when increasing state like the ability of network or database operations to keep up with validation needs.

Sharding

Another mitigation is layer 1 scaling in the form of state sharding (like we’re proposing). This helps to mitigate the issue by splitting the state across multiple shards so nodes only carry the relevant bits they need to validate transactions on their shard. But as state overall increases to infinity, the average state per node (let’s pretend it’s split fully across nodes) would still increase significantly unless the rate of growth in network node capacity increases fast enough to keep up.

It’s basically the same problem – you can think of each shard as operating like its own Ethereum network, so each shard has the same let’s-call-it-50GB of state increase every year at their maximum
capacity… only the cross-shard communication overhead continues to increase as the overall network topography expands into more shards and it’s obviously not as simple in reality as just thinking about a single shard in isolation.

Either way, as time goes to infinity, we again find ourselves battling the ability of nodes to store, process and transmit the necessary state against the overall network activity.

In the “good” case, improvements in technology outpace the growth of the network usage and node capability remains sufficient to keep up. More nodes can join the network and it decentralizes over time.

In the “bad” case, the network activity and state size exceeds node abilities, nodes fall out of sync and we either see a large spike in transaction costs that destroy the economics of businesses built on top of the chain or the runaway effects continue until the last node drops out and the chain dies.

The problem is that I haven’t yet seen an analysis that puts these factors next to each other on a chart such that it’s obviously fixed.

Pruning

Pruning (see this Stack Overflow post) basically allows for the garbage collection of old, now-obsolete branches of the state trie whenever new transitions edit or delete data. In a system with deterministic finality like POS with BFT (unlike probabilistic PoW), that’s viable.

The way pruning is generally presented is as a passive efficiency that is gained as a byproduct of everyday transactions. I’m particularly interested in how we can actually incentivize active pruning by creating incentives for applications to release their state.

It seems simply unsustainable that a one-time fee allows you to lock up state for all of eternity, creating infinite future costs for the network. At the very least, we should incentivize people to release state and allow it to be pruned away, eg by rebating some of their gas fees. The economics of this are particularly interesting (presumably you’d be more incentivized to remove old state when gas prices are high, which seems like an appropriate mechanism but might increase congestion further in that moment), but the concept bears further research.

Recommendations and Open Questions

  • We need to nail down the math on this. What are the second-order effects of increasing state size like network usage, database query times and compute requirements? If we can model these, we can at least gut-check assumptions about the ability of a Moore’s Law, Nielson’s Law or whatever law to keep up.

  • We need to think about how to incentivize freeing up space to create a strong countering force to this runaway state. I’d love to explore more about refunding a portion of gas fees for freeing state, even if that means potentially increasing them up front sort of like a bottle deposit at the grocery store.


#2

Hi Erik – I’ve been thinking about some related ideas for reducing storage requirements. Let me know what you think! I’m assuming a UTXO model.

  1. This may be obvious, but as a first step, we can discard some transaction data like signatures after each node has had plenty of time to validate them. We can just store a set of UTXOs, which are fairly small.

  2. Rather than actually storing UTXO data (public key, amount, optional script, etc.), nodes could store a 32-byte hash of each UTXO. The UTXO owner would be expected to store the original data and provide it later, when they spend the UTXO.

  3. Taking this a step further, we could have nodes forget about all the UTXOs in a block after a while, and store just the Merkle root of UTXOs in that block. The owner of a UTXO would be expected to store a Merkle proof that it was included in a certain block, and provide that proof when spending the UTXO.

  4. Taking this idea to its extreme, we could store the Merkle root of each block in a higher-level Merkle tree. This tree would be ordered by insertion, which lets us append values without storing the entire tree; we only need the Merkle path of the last leaf. When spending a UTXO, a user would present the root of the block containing the UTXO, then prove that that block root is contained in the higher-level tree (in addition to the Merkle proof in #3).

A nice side effect of this compact storage scheme is that in a sharded PoS blockchain, switching shards would be quick and easy. Whereas if stakers must download the entire transaction history, or even the UTXO set, of a new shard, that could become a huge bottleneck.

Edit: I realize now that #3 and #4 have a big problem – there’s no way to tell whether a UTXO has been spent. I see a couple possible fixes:

  • Just store a single “spent” bit per UTXO.
  • Have a small number of “archive” nodes which store the entire UTXO set (perhaps for one shard). They could generate Merkle proofs of deletion, showing what the updated root is when a UTXO is replaced with a nil value.

Edit 2: This paper has some interesting ideas, though they require users to sync regularly, and I don’t see how new witnesses would be derived: https://eprint.iacr.org/2018/968