Recently I’ve been looking at how to ensure services are always running globally across a number of data centres.
Why would you want to using Leadership elections to do this?
I want the service to remain available during outages which affecting a particular node or DC.
To do this you have two options, active-active services, where all DC’s can serve all traffic, or active-passive, where one DC or Node is a master and the others are secondary’s. There is also a halfway house but we’ll leave that for now.
The active-active example is better for some scenarios but comes at a cost, data replication, race conditions and synchronisation require careful thought and management.
For a lot of services the idea of a master node can greatly simplify matters. If you have a cluster of 3 machines spread across 3 DCs and they consistently elect a single, healthy, master node which orchestrates requests – things can get nice and simple. You always have 1 node running as the master and it moves around as needed, in response it issues.
The code it runs can be written in a much simpler fashion (generally speaking). You will only ever have one node executing it, as the elected master, so concurrency concerns are removed. The master can be used to orchestrate what the secondaries are doing or simple process requests itself and only use a secondary if it fails. Developers can write, and more importantly test, in a more managable way.
Now how does this affect scaling a service? Well you can now partition to scale this approach. When your single master is getting top hot you can now split the load across two clusters. Each responsible for a partition, say split by user id. But we’ll leave this for another day.
So how do we do this in .Net?
Well we need a consensus system to handle the election. I chose to use an etcd cluster deployed in multiple DCs, others to consider are Consul and Zookeeper. Lets get into the code..