Switcheo TradeHub suffered a chain halt on 8th Feb 2021 at 10:06PM SGT, which was unrelated to separate issue found on the preceding day. This is a post-mortem report for the incident.
In preparation for allowing trading of futures contracts, oracles for the BTCUSD and ETHUSD index was created on 16th Dec 2020. This meant that validators began submitting votes for prices of these assets every second. However, after the V1.12.0 upgrade in early February, some validators stopped running the oracle service on their subaccounts. Because a supermajority was not present, no oracle results were resolved for an extended period of time, while the oracle votes remained in the chain state. This implementation is intended to give a buffer period for slower oracles to join in each voting round. The oracle module then automatically purges votes that are beyond a certain age at the end of each block. However, it only performs this purge whenever a new valid result is resolved.
Because votes were being casted but not resolved for a long time, a total of 9.1mil oracle votes were accumulated at the point of the incident. Typically, this number should be less than 100, if the oracle service was running normally.
On 8th Feb the team, not aware of the accumulation of data, informed validators that they should begin monitoring their oracle service to support the launch of futures trading. At around 10:00PM SGT, a validator rebooted his oracle, allowing votes to finally cross the required threshold, thereby forming new valid results.
This immediately caused all 9.1m vote entries to be deleted at the end of block 7290435. The operation took more than an hour for a high performance node to process, mainly because the chain database structure and implementation is optimized for fast reads rather than fast deletions.
While the team initially thought that the block will eventually be produced, it turned out that the operation also incurred too high memory usage, causing many validator nodes to be killed half way through processing the block, and the chain was unable to reach consensus.
The team decided that a hotfix upgrade immediately limiting the number of deletions per block was required, and quickly prepared the new node binary.
The actual downtime for the chain was exacerbated because the hotfix had to be carefully coordinated such that it is performed only when all validators have acknowledged the upgrade off-chain. A further consensus issue arose after the hotfix was deployed due to the way Tendermint saved the previously valid pre-hotfix block in some validators, extending the downtime further. This was when ultimately resolved by resyncing the chain on some validators.
Timeline of Events
8 Feb — 10:06PM SGT: A validator informs us that their node is not progressing. Core dev [J] responds immediately, pinging the rest of the team to investigate.
10:09PM SGT: Core dev [Y] confirms that blocks are not being produced in the expected timeframe and begins investigating the root cause.
10:21PM SGT: Core dev [Y] confirms that the nodes have not halted but are stuck processing a single block for an unexpectedly long period of time.
10:55PM SGT: Core dev [J] identifies the end block portion of the oracle module as the culprit that is causing the long block.
10:57PM SGT: Core dev [I] informs contactable validators that they should not restart nodes as the block is still being processed.
11:18PM SGT: Core dev [Y] identifies that there are 9.1m records to be deleted.
9 Feb — 12:30AM SGT: Core dev [J] reports that the Switcheo TradeHub node process on the validator and sentry machines for Switcheo Staking have been automatically terminated due to insufficient memory.
12:48AM SGT: Core dev [I] recommends a hotfix upgrade to limit the number of oracle votes purged in each block, as achieving consensus no longer seemed likely.
1:09AM SGT: Core dev [J] suggests waiting for more validators to be available before releasing the patch and performing the hotfix upgrade.
1:14AM SGT: Core dev [I] agrees with the delaying after discussing the potential downsides.
1:29AM SGT: Core dev [I] informs contactable validators that a patch will be coordinated the next morning so that all validators can first receive and acknowledge the required upgrade procedure.
11:02AM SGT: Core dev [I] informs validators to confirm that their nodes are temporarily powered down to avoid accepting the pre-hotfix block.
11:52AM SGT: Core dev [J] releases the V1.12.2 hotfix upgrade. Validators begin upgrade process.
12:19PM SGT: A validator reports that their node is producing an error regarding an invalid state hash for a proposed block.
1:33PM SGT: Sufficient validators patch node to achieve consensus.
1:38PM SGT: Core dev [I] notes that no new block has been produced even after 4 Tendermint BFT rounds. Core devs begin investigating.
1:57PM SGT: Core dev [I] identifies that due to how the Tendermint consensus is implemented, it is possible that validators have already accepted the pre-hotfix block as a “valid block” that is saved in state, and will only use it for proposing blocks during consensus, even after the hotfix.
2:26PM SGT: Core dev [J] suggests resyncing the Switcheo Staking validator from a snapshot so that it will not have the previous valid block on disk.
2:27PM SGT: Resync process for Switcheo Staking validator begins.
5:16PM SGT: Resync process completes, but the chain still fails to achieve consensus.
5:32PM SGT: Core dev [I] identifies that >33% of validators also have the same issue, and therefore the chain will not be able to achieve consensus as these validators will match the proposed block state hash against their previously saved valid block.
5:49PM SGT: A list of validators that have saved the pre-hotfix block is produced by a community developer.
5:51PM SGT: Core dev [I] requests that validators with the pre-hotfix block resync their validator nodes from backups and shares a possible procedure for doing so. Validators begin performing the resync as soon as they can.
10:21PM SGT: Validators complete the chain resync, and block 7290435 is produced.
A complete fix for the underlying issue is in the upcoming v1.14.0 upgrade, where stale oracle votes will be discarded regardless of the presence of a valid result, and the number of deletions per block will be further limited. Switcheo’s core devs are also re-audited the codebase to find and limit unbounded operation.
Switcheo’s developers spend every effort in minimizing the possibility of downtimes occurring when developing protocol modules. While incidents like these are unfortunate, it is hard to guarantee that they won’t occur again during the teething stage of the protocol. Instead, we intend to continue to roll out protocol features in a gradual but consistent and careful manner, while adapting quickly should any fault be detected. With this strategy and your support, we are confident that the protocol will eventually achieve its ultimate vision.
We are again extremely thankful to the Switcheo community for their patience during this incident as well as the Switcheo TradeHub validator operators for their persistence and support in resolving the issue.