One of the most common challenges that we software engineers have sometimes is how to ensure that only one component in our distributed application is doing a certain computation at a time.
For example, let’s say we are running three nodes in our application and we need to run a scheduled job daily. How can we make sure that only one of the nodes triggers the job? If we were sending an email to customers in this job and the three nodes trigger the task, our customers could receive this email three times! We don’t want that, so what can we actually do?
Some people might say: “Let’s run only one node! Easy!”
Well, not that easy. In most of the cases we must ensure an adequate level of availability for our service, running just one node would mean that our service would be unavailable if it has a problem.
What we actually need is a way to select some sort of “master node” responsible for this task. Another important aspect to consider is that if our master node fails, this responsibility has to be delegated to one of the secondary nodes immediately to avoid disruptions.
Let’s take a look at what we want to achieve visually to see it clearly.
What we need is a simple way to “elect” one leader which will be responsible for the task, the rest of the nodes will patiently await for their turn in the case their help is needed. These nodes would be the in a so called “dormant” state; they’ll only be awaken if the leader node is lost or it becomes unresponsive.
How Could We Solve This Problem
In some cases people would decide to implement some fairly complex implementation to make sure that only one of them performs the task.
Some database engines nowadays support a sort of “compare&set” operation atomically, that could be a reasonable way to solve this issue quickly. We take advantage of one of the features of our database to solve a challenge without reimplementing it ourselves. But what if our database doesn’t support such atomic operations?
Things would get more complicated because every node would try to compete to achieve the lock, but two nodes could get the values of the lock as “free” at the same time and both set the value successfully without noticing that another node has also “taken the lock”. This would mean that not only one node but two, would be performing this task; in our example sending an email to the customer.
However even if we have support for compare and set operations in our database, we need to provide some mechanism to make sure that if our master node dies another node will take over; something like a heartbeat process to constantly check status and act accordingly when things fail. This is something that takes time to implement ourselves, ideally we’d want to take advantage of what others have already done and tested thoroughly.
And this is precisely what leads us to solve this challenge using something like Apache Pulsar if we are already using it in our system. A similar solution could be achieved with Kafka probably, but we’ll be focusing on Apache Pulsar in this topic.
Let’s see how would it look like!
Using Apache Pulsar
How can we take advantage of having Apache Pulsar available in our system then? Pulsar provides a type of subscription called failover, which basically implements a leader election mechanism for us.
So how can we make use of it to guarantee that our scheduled jobs are run only once?
Without going into too much detail of the implementation, as this would highly depend on your personal use case, we can think of a simple way to do it. Let’s see it quickly!
Auto-Start Schedulers Based on Heartbeat Events
One of the ways to achieve what we need would be to start a consumer that first starts listening to heartbeat events and also starts sending heartbeat events right after that. These consumers will be subscribing to our topic using a “failover” subscription, therefore only one of them will be able to start their schedulers. If our master node dies, then one of our secondaries will take over and will start the jobs immediately.
That’s basically the idea, let’s see this with a diagram to understand it better!
In this example we have a topic to manage the distributed lock, every consumer sends heartbeats periodically to this topic and at the same time subscribes to the topic using a failover subscription. Only one of them will become the primary and will process heartbeat events. If the primary node hasn’t started the email scheduler yet, it will start it as soon as it receives the first heartbeat; for the rest of heartbeats they’ll be ignored as long as the scheduler is running.
As simple as that, what happens now if our primary dies? Let’s see how that would look like!
In the scenario where our primary node dies, Pulsar’s failover subscription would detect that the node has died and the second node in priority will take over. In our diagram the secondary node on the left will receive a heartbeat and that will make it trigger the start of the job. Once the former primary gets recovered, it will patiently wait for his turn as a secondary node.
That’s basically it, as simple as that. We really hope you’ve found this example useful!
I wouldn’t recommend start using any new technology just to implement a distributed lock, but often we have to try to be clever and take advantage of what we already have available in our system to save us time and unnecessary complexity.
We could build our own distributed lock ourselves, of course, but that would take time and it’s also an error-prone task; by using Pulsar we are already taking a feature that has been thoroughly tested by other engineers, saving us a precious time and other possible problems.
That’s all from us this time, we really hope you’ve enjoyed this article and we’re looking forward to seeing you again soon!
If you like our articles please subscribe; you can also submit a donation if you want to support TheBoredDev.com to continue online. We’d really appreciate it!