The 17-hour outage has exposed a critical design flaw and grossly exaggerated tps claims.
By Douglas Horn, Chief Architect, Telos Blockchain
On September 14 at 12:38 PM UTC, Solana announced that it had encountered a denial-of-service disruption caused by a surge in transaction loads, reaching as high as 400,000 attempted transactions per second (TPS), that overwhelmed its network and led to its longest downtime yet. While initial tweets by the official Solana Support Twitter account suggested it was due to its mainnet-beta network “experiencing intermittent stability,” Solana Labs’ CEO Anatoly Yakovenko eventually attributed the network’s failure to the overwhelming transaction volume from bots during an initial decentralized exchange offering, or IDO, for the Grape Protocol project that was taking place on the Solana-based decentralized exchange Radium. Although the flurry of transactions caused problems for Solana’s transaction queues, a deeper analysis is warranted to understand the maladies plaguing Solana and why it will take more than “bug fixing” as exclaimed by Yakovenko to correct the issue.Frothing up the TPS numbers
Solana’s claims of offering superior scalability as compared to Ethereum seem misleading when you consider how toughly 80% of Solana’s transactions are the chain's own consensus messages required simply to coordinate validators. These processes are typically handled separately from on-chain transactions via a distinct communications channel but are inexplicably bundled with user transactions on Solana’s blockchain.Examining the transaction numbers demonstrates this illusion. The much ballyhooed number of 400,000 tps is simply the amount of transactions that were attempted by bots. As for the transactions actually being written to the chain, that was approximately 2,000 tps of which up to 350 tps are actual user transactions and the rest were validator consensus messages. Anatoli Yakovenko confirmed this with his response to a Twitter question about the actual maximum tps prior to the outage:
"I don't think it was significantly higher then (sic) normal."
I don’t think it was significantly higher then normal.
— anat◎ly 🦀🤿🏒 (@aeyakovenko) September 15, 2021
Solana’s normal speed is about 2,000 tps of which 80% are consensus messages, not the user or smart contract transactions that networks like Ethereum can process at speeds of 15 TPS or even the 10,000 TPS capability offered by Telos. Further, the massive amount of consensus messages required in voting on each new fork means that as more Solana validators join the cluster, the number of consensus messages grows exponentially, not linearly as described in “the handshake problem.” This protocol design that artificially inflates the TPS numbers is a large part of why the Solana chain was halted (0 tps) for nearly 17 hours before validators on the network could help in restarting it. The point where a chain loses consensus and crashes or must be stopped is its actual throughput. For Solana, that real number is around 2,000 tps, but only 350 of those are user transactions that would be counted on any other blockchains.
Problems with prioritizing a single channel
In addition to the concern of exponential increase in consensus messaging as the network grows, Solana lacks any form of prioritization rules being set which would have ensured better handling of critical consensus message processing in scenarios when the transaction queue gets flooded. Instead, at such times, such consensus messages would be displaced leading to a lack of syncing between the validator nodes and likely a growing amount of consensus rollbacks further congesting the network. Based on the tweets that emanated from Solana’s official account's post about the outage, it seems likely that they will pursue prioritizing network-critical consensus messaging over user transactions as opposed to creating a separate blockchain consensus channel. The latter solution offers the more conducive approach that would create a distinct communication channel between validators that cannot be interrupted by user transactions and doesn’t require prioritizing messages--which risks introducing centralization of BFT--instead and ensure recurrence prevention instead of the “quick-fix” approach that seems intended.Embrace real solutions and real numbers
Solana's vulnerabilities can be resolved by emulating the tried-and-true way that blockchains like Ethereum, Telos and most others handle consensus messaging via a distinct communication channel from user transactions but at the expense of frothy theoretical max tps numbers for the marketing department. For Telos, this proven design has been running with 100% uptime for almost 3 years with over 173 million blocks despite occasional attempts to overwhelm the network. Telos’ claim of 10,000 tps on the native chain reflects a tested maximum not a theoretical one, backed by tests on the EOSIO Jungle Testnet performed by block producers running Telos and other eosio chains. and this sets it apart from other competitors like Solana that evidently have a lot of catching up to do.The race to post ever higher theoretical maximum tps numbers to wow investors has led put the cart before the horse in Solana's design, but they are hardly the only chain that's guilty of advertising untested and unachievable tps numbers. We in the crypto community bear some responsibility here for blindly FOMOing over such numbers without ever asking to see test results. Let's stop this and get real about tps numbers so blockchain designs won't need to compromise reliability trying to exaggerate stats.
About the author:
Douglas Horn is the Whitepaper Author and Chief Architect of the Telos blockchain. He is also the Founder and CEO of GoodBlock, a dApp development company, which creates cutting-edge dapps, tools and games. Prior to blockchain, Douglas worked in the entertainment and gaming industries.
No comments
Post a Comment