Gossip protocol (a.k.a Epidemic protocol) is a peer-to-peer communication protocol allowing state sharing in distributed systems. It is a decentralized, scalable, and fault-tolerant protocol that can be used to disseminate information to a large number of nodes. In this blog post, we will discuss the basics of gossip protocol, its advantages and disadvantages, and some use cases for gossip protocol.
Multicast problem
One of the common problems in a distributed system is the Multicast problem, where we want to disseminate information from 1 node to a group of other nodes.
The multicasting problem has a few requirements and constraints:
- Fault tolerance: Nodes may crash, the network may be imperfect, and information packets may be dropped or delayed. Despite this, all healthy nodes should receive the information.
- Scalability: Clusters may contain thousands of nodes. The time it takes to disseminate information to all nodes, known as the latency, should be as low as possible.
Centralized
One of the most straightforward approaches is a centralized sender. This server has a list of receivers and simply goes through the list, sending the message to each one. However, this approach has many limitations:
- It is not fault-tolerant: not all recipients get the message if the sender node fails halfway.
- High latency: The latency grows linearly as the number of nodes grows (O(n)). Additionally, the sender node has a substantial overhead of sending thousands of messages.
Tree-based
Tree-based approaches develop a spanning tree of the nodes in the group. Each node in the tree receives the message from its parent and forwards it to its children. If the tree is balanced and there are N nodes, the latency is $O(logN)$. Additionally, if each node has a constant number of children, each node's overhead is constant. However, if a node near the root fails, it will affect a large number of nodes.
Gossip
Gossip protocol was developed to address the limitations of centralized and tree-based protocols. In Gossip protocol, the sender periodically (e.g., once every 5 seconds) picks b target node at random and sends them the message. Once a node receives the message (a.k.a being "infected"), it becomes the sender and performs the same process of periodically picking b random target and disseminating the message.
Because target nodes are picked at random, there might be nodes that receive the message multiple times.
This protocol works similarly to the way an epidemic spreads. After a few rounds, all the nodes in the group are "infected."

Gossip protocol analysis
In this section, we analyze the latency and reliability of Gossip protocol.
Let's set up the model first:
There are $(n+1)$ nodes in the system. Each infected node randomly picks b target node and disseminates the message.
- Let B be the probability that an infected node picks a particular uninfected node as a gossip target. B = b/n.
- Let x be the number of uninfected nodes, and y be the number of infected nodes. Initially: $x = n, x = 1$. And $x + y = n + 1$ at all times.
Because this is a continuous time process, we can write this a differential equation of the rate of uninfected nodes over time (time is measured as the number of rounds since the gossip protocol started).
$$\frac{dx}{dt} = -Bxy $$
$xy$ is total number of potential infected/uninfected contacts per time unit. Among all those potential contacts, only a fraction of $B$ happen. We have a minus sign because the number of uninfected nodes decreases over time.
$$\frac{dx}{xy} = -Bdt$$
$$\frac{dx}{x(n+1)} + \frac{dx}{y(n+1)} = -Bdt$$
$$\frac{dx}{x(n+1)} + \frac{dx}{(n+1-x)(n+1)} = -Bdt$$
$$\int_0^t\frac{dx}{x(n+1)} + \int_0^t\frac{dx}{(n+1-x)(n+1)} = \int_0^t-Bdt $$
$$\frac{ln(X_t)-ln(X_0)}{n+1} + \frac{ln(n+1-X_t) - ln(n+1-X_0)}{n+1} = -Bt$$
$$ln(X_t) - ln(X_0) - ln(n+1-X_t) + ln(n+1-X_0) = -B(n+1)t$$
substitute $X_0 = n$ (when Gossip protocol starts, only 1 node is infected):
$$ln(X_t) - ln(n) - ln(n+1-X_t) + ln(1) = -B(n+1)t$$
$$ln(\frac{X_t}{n(n+1-X_t)}) = -B(n+1)t$$
$$\frac{X_t}{n(n+1-X_t)} = e^{-B(n+1)t}$$
$$\frac{X_t}{n(n+1-X_t)} = \frac{1}{e^{B(n+1)t}}$$
$$X_te^{B(n+1)t} = n(n+1) - nX_t$$
$$X_t = \frac{n(n+1)}{n+e^{B(n+1)t}}$$
$$Y_t = n + 1 - X_t = \frac{n+1}{1+ne^{-B(n+1)t}} = \frac{n+1}{1+ne^{-b/n(n+1)t}}$$
To simplify this equation, we choose $t = clog(n)$
$$Y_t = \frac{n+1}{1+ne^{-b/n(n+1)clog(n)}} \approx \frac{n+1}{1+\frac{1}{n^{cb-1}}} = (n+1)\frac{n^{cb-1}}{n^{cb-1} + 1} \approx (n+1)(1- \frac{1}{n^{cb-1}}) \approx (n+1) - \frac{1}{n^{cb-2}}$$
Okay, let's translate to human language:
- $t = clog(n)$ means we have $clog(n)$ rounds of gossip so far
- number of infected node at time $t = clog(n)$ is $(n+1) - \frac{1}{n^{cb-2}}$
If I choose a small number of c and b, (for example, c = 2, b = 2) $\frac{1}{n^{cb-2}}$ becomes very close to zero. That means after 2log(n) rounds of gossiping, most of the nodes in the system are infected.
In summary, if we set c, b to small numbers independent of n, we can achieve:
- low latency: only $O(clog(n))$ rounds of gossip.
- reliability: all but $\frac{1}{n^{cb-2}}$ nodes receive the message.
- high performance: each node has transmitted no more than $cblog(n)$ messages.
- fault tolerance:
- packet loss: let's say the network has 50% rate of packet loss: $b$ becomes $b/2$.
- node failure: let's say 50% of nodes fail: $n$ becomes $n/2$ and $b$ becomes $b/2$
We can achieve the same reliability as 0% packet loss with twice as many rounds $2clog(n)$. - because target nodes are picked randomly, the probability that the b first nodes all fail is low.
Applications
Gossip protocol is commonly used in NoSQL distributed databases (Cassandra, DynamoDB, ...) to share information about the cluster’s state and nodes.