Calamari V3.2.1 upgrade disruption post-mortem

Incident Summary

On 7/26, a community collator runner reported abnormally high CPU usage after updating to the recently released v3.2.1 manta node binary.

Upon receiving and confirming the report, Manta immediately advised the collator community to hold off on updating and continue running the v3.2.0 version while working on locating the issue.

During our investigation we found that an upgrade to Substrate v0.9.22, embedded in our release. introduced a memory leak to the client binary related to the BEEFY consensus gadget as reported here, leading to significant memory leakage.

As we have confirmed that the memory leak is confined to the manta binary and not the on-chain WASM runtime, we advise the community to vote for deploying the new runtime to the chain while continuing to use the v3.2.0 manta node binary. A corresponding on-chain governance proposal is currently being voted on by the KMA Council and will be live as a Referendum for everyone to vote on by Sunday (US time).

Unfortunately when downgrading our own nodes back to v3.2.0 we triggered another issue by failing to revert some node configuration which had been updated as part of the v3.2.1 upgrade.This resulted in some nodes failing to synchronize and to respond to peers. This in turn led to centralized exchanges ( e.g. Kucoin ) not being able to interact with the Calamari network for 2 to 3 hours.

An upgrade to the most recent v0.9.26 substrate version containing the fix for this memory leak was scheduled and merged for the next v3.3.0 release here.

Thank you for your patience and understanding as we worked to resolve this issue and especially for the initiative taken by our community collators in proactively reporting this issue. Although the Calamari network experienced a slowdown in block production for a few hours, it remained operational thanks to the decentralized nature of the collator program.

Key Takeaways

  • A memory leak in the node software should not have passed our QA process and we have modified said process to ensure this will not happen again in the future. As our testing on testnet/Dolphin did not expose this bug before release, we are adding Calamari on Kusama-local (dolphin is on the Rococo relaychain) testing to our release procedure to ensure we’re as close as possible to the production system.
  • We are improving our node monitoring process to catch abnormal behavior earlier.
3 Likes