Ethereum upgrades at scale
Running Ethereum infrastructure involves maintaining multiple interdependent bricks built by different teams and projects that all need to work together to effectively handle the computational demands of the network. As the Ethereum protocol is in constant evolution, having a sound strategy to approach each component’s updates is key to properly operating at scale.
Components and Dependencies
It is important to understand the different components involved and how they interact to be able to devise a sound strategy to approach updates. The Ethereum stack is a complex system of machinery, usually composed of the following components:
Consensus Nodes
Consensus nodes are responsible for ensuring the protocol’s rules of Proof-of-Stake are properly followed by all participants. They must result in the creation of a slot every 12 seconds in which an Ethereum block can be inserted. This layer is made up of beacon nodes which are connected to:
- Other beacon nodes all over the world
- Internal validation nodes which they provide payloads to sign to
- Execution nodes or execution builders which create the actual content of blocks.
Consensus nodes can handle multiple validators and validation keys under the hood. Whenever they see that one of the managed validation keys needs to propose a block, they ask an execution node to create it, and the validation node to then sign it.
There are multiple teams developing consensus nodes, the most commonly used ones are:
Validation Nodes
Validation nodes are responsible for preparing payloads coming from beacons. They ensure that they are ready to be signed and check that the payload looks correct & has not been previously signed. They eventually sign them with validation keys and send them back to beacons to propagate into the network. Validation nodes are typically responsible for a set of validator keys; the actual signature can be delegated to an external component to improve safety and security.
Teams developing beacon nodes usually also develop a validator node that can be paired, the most often used ones are similar to beacon nodes:
The validation and consensus nodes can be interoperable (it is possible to pair a Teku validator with a Lighthouse beacon), but not always. There is ongoing work to enable all validators and beacons to communicate together.
Remote Signer
Remote signers are responsible for the actual signing of payloads with the private validation keys. This layer is optional as validation nodes are able to do it themselves. They can however offer additional security when operating at scale as we have discussed in our anti-slashing guide.
Remote signers horizontally scale and can be run in parallel as long as they are connected to the same underlying database (PostgreSQL, MySQL).
Currently, the most complete remote signing solution is Web3signer.
Execution Node
Execution nodes are responsible for producing the content of blocks. This layer is usually represented by execution nodes and connected to beacons directly: this is the bridge between the consensus layer and the execution layer. It is used to propose execution blocks containing the actual Ethereum transactions. It is also used by beacon nodes to identify which validation keys are staked on the deposit smart-contract.
At the time of writing we are still in the pre-Shanghai fork. More operations combining the two layers are iminent with the withdrawal of consensus funds and rewards at the execution layer on their way.
Teams developing execution nodes are independent from the teams working on the consensus layer, the most used execution nodes are:
Overview and Failure Modes
Putting it all together results in the following topology:
The goal from an operator’s perspective is to ensure that active validation keys are attesting block proposals from other participants as well as proposing blocks when it’s the operator’s turn. It is also essential that all this is completed in a timely and effective fashion .
Assuming upgrading one of the components results in a full failure of that component, the result would be an outage looking like this:
Assuming upgrading one of the components results in a performance regression:
Legend:
This is a simplified view, there are other types of outages other than full-failures or performance regression, but this gives an idea of what to look for when a component is updated.
Choosing the rollout frequency
About update cycles
The software life-cycle of each of components can be roughly viewed in this way:
Minor updates are frequently available, improving performances or adding new features to the various components. Generally speaking these updates aren't strictly required but recommended, it is important to check their changelogs as there can be important changes, especially when relying on specific features not enabled by default. Their frequency is typically once every few weeks.
Once in a while, depending on the protocol’s evolution, a major upgrade will become available; upgrades are compulsory if we intend for components to continue operating with the Ethereum upgrades. Upgrade frequency is typically once every few months.
Balancing the risk
From a risk management perspective, every update has an inherent risk of breaking a component. From config changes to subtle bugs that only impact your stack as you rely on a specific feature, or just bad luck through updating during a specific condition of one of your validators, you’ll find that things may break when updating. On average, living on the edge and updating your stack for every minor update will result in smaller but more numerous outages in the long run.
However, the more you delay updates, the more changes will accumulate in new versions to deploy, compared to your previous setup, leading to a larger potential impact of each outage. There is thus a necessary balance to find between frequent updates and the size of your update.
Choosing your rollout frequency
The Ethereum ecosystem has different networks where you can aggressively test all minor updates without risk. At the same time you can perform major upgrades a few weeks before they hit the mainnet network. The following networks are available for this type of testing:
- Sepolia: usually the first testnet to receive major upgrades, only selected operators onboarded by the Ethereum foundation can validate there, but this doesn't prevent everyone from running non-validator nodes there (execution clients, beacons, …),
- Goërli: the largest testnet in terms of validator count and the closest to mainnet. Usually updates in Goërli happen a few weeks before mainnet release,
- Mainnet: the actual Ethereum network.
Because the stakes on testnets aren't real, it is possible to check their impact there. This will cost you time due to the complexity of the architecture: versions between different components may present unforeseen bugs which you’ll be the first to observe, investigate, and report. The reward for this is high as you’ll get a deeper understanding of the ecosystem and develop expertise, which is essential to build confidence in your mainnet setup, especially whenever big outages happen. We thus recommend aggressive upgrade approaches on testnets and conservative ones on the mainnet.
To give an example, at Kiln, we are present both on Sepolia, Goërli and mainnet, we use the following strategies for each:
- Sepolia: We follow major upgrades as fast as we can, and from time to time apply minor upgrades previously qualified in Goërli. The rationale here being that Sepolia is not meant to be a test-bed for operators but more for smart-contract developers and the community; we can't afford to break it. Yet as it's the first testnet to receive upgrades, we need to be upfront and report major issues before they land in Goërli
- Goërli: We have a large number of validators here and have an architecture that is symmetric to our production environment. We test minor updates selectively, depending on whether we think they are something we want on the mainnet later. We usually are up-to-date within 2 weeks on all minor versions of all components, once we have qualified a combination of component versions for a week, we then consider the mainnet
- Mainnet: We try to be very conservative there, and usually when upgrades arrive they would have already been baked in Sepolia and Goërli for a week or two
Enter Canaries
Once you know how to approach upgrades, there are ways to reduce the blast radius impact of potentially breaking changes. This is especially relevant on the mainnet network where the stakes are… high. The idea is similar to canaries used in coal mines which miners would bring with them to detect whenever toxic gas was around. Canaries would be the first to experience symptoms, which would give an early signal to miners to safely leave the place before it's too late.
Ethereum Architecture with Canaries
Applied to a production environment as seen above, the goal is to reduce the initial scope of upgrades by only impacting a smaller subset of Ethereum stakes (our canary stakes). This is to see how they behave with the new versions before moving on with the rest of the fleet.
This can be achieved by having a special parallel stack with a smaller number of keys, upgrade stacks to new versions first, monitor how they perform, then decide whether or not to move on with the rest of the fleet.
In this setup, we can imagine for instance having a canary validator node with a single validation key, a second one with 10 keys, a third one with 100 keys. You can roll out updates of validator nodes on smaller key scopes first, watch for regression signals at each step, then proceed with the next validator.
This canary approach is quite simple to follow and it can literally save millions when things go wild. This is not a perfect solution though as some bugs will only show up when running at scale or whenever a rare event happens, which will statistically happen more often once fully deployed.
As a conclusion, there is no magical recipe to manage Ethereum upgrades at scale but there are good practices to implement that have proven their success. Here are the main takeaways:
- Be conscientious and check all components as there are many dependencies
- Implement in cycles of every few weeks to make sure you test all your components frequently in the different testnet environments available on Ethereum
- Getting a canary (aka a special parallel stack with a smaller number of keys to run your tests).
Thanks to Sebastien Rannou for writing this article, as well as the Ethereum Foundation for their support.
About Kiln
Kiln is the leading enterprise-grade staking platform, enabling institutional customers to stake assets, and to whitelabel staking functionality into their offering. Our platform is API-first and enables fully automated validators, rewards, and commission management.