Node performance variance


#1

An interesting question from Telegram: According to the documentation, nodes are compensated by the number of rounds contributed and the computations performed. Is it safe to assume that faster nodes (faster processors, servers) could end up completing more rounds and therefore be compensated more than slower nodes? For example, would a server node be able to complete more rounds than a laptop nodes, and therefore be compensated more?

This is a delicate topic that touches one aspect of the Performance <–> Decentralization trade-off. To explain why, it’s important to explain synchrony. Distributed protocols can work in one of two models - synchronous and asynchronous. Synchronous protocols proceed in rounds, where each round has a timeout. Blockchains are generally synchronous protocols - each block is considered one round/unit of time - all messages in that block are considered to have occurred at the same time. This means that if there are two messages that depend on each other, they have to occur at different rounds, which leads to latency, which is one reason why blockchains are slow.

However, In an asynchronous system, which is more naturally how the internet works, every message arrives as fast as it can. There’s no reason to wait for one round to finish, before building on the results of that round. Unfortunately, there’s a known seminal impossibility result that a fully asynchronous system with even one fault, won’t be able to reach consensus on a result (there are ways to somewhat get around this with more assumptions). This is a very intuitive result, since imagine there are no timeouts/synchrony points and we are simply waiting indefinitely for a message/computation result from a node. This message could come very very quickly (in the optimistic case), but could also never arrive if the node crashed or is just being malicious.

With that background in mind, it should be clear that we need to set some time-out bound on how long users of the system wait before a round is completed. In other words, imagine a user requests a computation from the network, and a node is selected at random to perform it. There should some time limit on how long we should wait for a result (and if we didn’t receive a result by the end of that time, we should penalize that node). But how do you set that time limit? If you optimize it for really fast nodes (servers as opposed to laptops), you limit the type of nodes that can participate --> leads to centralization, but in return you can get better performance, because only really strong servers can participate and they have the urgency to complete computations fast - or they are penalized. If you set the limit too high, then nodes can delay computations without any consequences for up to time T.

The way we’re thinking to resolve this is to make the time bound T large enough, which allows nodes to use slower hardware and recover from accidental failures, but at the same time, also allow batching computations. This gives nodes the incentive to execute as many computations as they can (like in the optimistic asynchronous case - which is optimal) within a single round, as they would earn more rewards that way. This means that running faster hardware would be useful, assuming there’s enough activity in the network, but this doesn’t prohibit commodity hardware like laptops from participating.


#2

Thanks a lot for that detailed explanation. Much appreciated.


Node going offline without a penalty
Multiple Questions