Switcheo TradeHub experienced a chain halt yesterday, 7th Feb 2021 at about 10:34:46 UTC on block 7252686. This is a post-mortem report on the incident.
On block 7131150, a change to the AMM module was introduced through the v1.12 node software upgrade in order to improve Switcheo TradeHub’s AMM quoting logic.
This logic involves an inverse computation of the Constant Value formula, to automatically quote maker orders on the order books based on the liquidity of the corresponding pools.
The root cause of the incident was that an incorrect rounding direction was used when rounding decimals to integers in the new implementation that was introduced in the v1.12 upgrade.
Switcheo TradeHub is structured as multiple independent modules, and this incorrect rounding violated a safety invariant in another module — the output amount expected by the AMM (automated market maker) module was slightly more than the output allowed by the LP (liquidity pool) module, triggering an internal consistency safeguard and halting the chain.
The Switcheo TradeHub code was designed to prefer safety over liveliness. The chain therefore performed as expected, choosing to fail completely rather than allow the possibility of loss of funds.
Timeline of Events
7 Feb — 6:36PM SGT: Switcheo Staking validator monitoring service triggers downtime alerts to Switcheo core devs.
6:39PM SGT: Core dev [J] responds, noting that another validator has reported that all their sentry nodes are down.
6:44PM SGT: Core dev [J] suspects that the chain has halted.
6:46PM SGT: Core dev [H] confirms that the chain has halted, and requests additional support from the rest of the team.
6:47PM SGT: Other core devs respond, and begin looking for the underlying issue, and begin preparing a procedure for restoring chain liveliness.
6:52PM SGT: Core dev [Y] suspects that the underlying issue is caused by a bug in the implementation of rounding of values within the AMM quoting logic, and suggest bypassing the safeguard temporarily to allow the chain to progress.
7:00PM SGT: Core dev [S] puts up a downtime notice on relevant user interfaces.
7:24PM SGT: Core dev [Y] begins copying chain data to a debug machine to simulate the issue and confirm the root cause.
7:25PM SGT: Core dev [Y] mentions that finding the location of the rounding issue then implementing and testing a full fix could take an unacceptable amount of time.
7:49PM SGT: Initial public notice on the issue released on Twitter.
7:54PM SGT: Core dev [J] agrees that a temporary hotfix upgrade is more viable.
7:55PM SGT: Core dev [I] proposes a recovery procedure and suggests bypassing safeguard for deltas that are smaller than the LP swap fee, meaning that the chain can be allowed to progress if the issue cannot be exploited. The recovery plan indicates that the changes in the planned v1.13.0 upgrade should be bundled with the hotfix as an emergency upgrade in order to avoid a potential issue where the fix could be accidentally rolled back later on. The v1.13.0 upgrade will then be cancelled when the chain liveliness is restored.
8:00PM SGT: Core dev [J] gives suggestions to improve recovery plan.
8:01PM SGT: Validators with contact info are notified to be on standby to receive the emergency upgrade.
8:04PM SGT: Core dev [Y] agrees with the suggested bypass.
8:10PM SGT: Core dev [J] agrees with bundling v1.13 changes in the emergency upgrade.
8:18PM SGT: Core dev [I] completes initial implementation of the proposed bypass.
8:19PM SGT: Core dev [Y] suggest improvements to the hotfix.
8:21PM SGT: Core dev [I] completes implementation of the suggested improvements.
8:25PM SGT: Core dev [Y] confirms that the final implementation of the hotfix is viable.
8:27PM SGT: Core dev [J] prepares to release the hotfix as the v1.12.1 emergency upgrade and begins building and uploading the required binaries.
8:42 PM SGT: Validators are notified that the hotfix is being prepared.
9:02PM SGT: Core dev [J] completes the release process for the emergency upgrade.
9:09PM SGT: Validators are notified of the release and briefed on the chain recovery plan.
9:21PM SGT: Validators that are not in the top 4 ranking by delegated stake begin starting their nodes with the new binaries.
9:37PM SGT: 42% of validators by voting power have updated their nodes.
10:02PM SGT: The remaining validators (in the top 5 ranking) continue waiting for more validators to complete their upgrade, so as to avoid them getting inadvertently slashed when the chain resumes block production, due to downtime.
10:12PM SGT: 59.38% of validators by voting power have updated their nodes.
10:13PM SGT: Core dev [Y] identifies the underlying issue and prepares the full bugfix.
10:19PM SGT: The remaining 3 validators discuss if they should wait longer as a validator with a large delegated stake remains uncontactable, in order to avoid slashing a large number of delegators.
10:48PM SGT: Remaining validators agree that sufficient waiting time has passed and that the chain liveliness needs to be restored.
10:54PM SGT: Remaining validators start their nodes, bringing >66.6% of voting power online.
11:00PM SGT: Block 7252687 is produced and chain liveliness is restored.
11:07PM SGT: Public notice regarding incident resolution is issued.
11:12PM SGT: Validators are informed that the existing v1.13 upgrade is no longer necessary and the proposal can be aborted by voting “No”.
11:17PM SGT: Core dev [Y] confirms that the fix for the underlying issue is correct and will be deployed in a new v1.13.1 binary.
8 Feb — 12:56AM SGT: Remaining validators completes upgrade in time. Only one remaining validator with a small delegated stake is jailed.
1:28AM SGT: Governance proposal #27 (v1.13 software upgrade) is rejected, avoiding the possibility of a erroneous rollback.
As the underlying issue is non-critical, it will be fixed in an upcoming upgrade (v1.13.1), in which the temporary bypass will also be removed.
Switcheo’s core devs have found that the erroneous piece of code was already extensively tested through unit tests and fuzz testing. However, the issue was not uncovered beforehand as it is only triggered by an extremely narrow set of values that was not within our fuzz testing input range. In order to reduce the possibility of similar issues occurring again, we will extend the range of input values that is used in all our fuzz testing suites.
We thank the Switcheo community for their patience during this incident and are extremely grateful to the validator operators for their quick response, which allowed the chain liveliness to be restored extremely quickly.