Translated into slightly-less-technical terms:
Our database already runs on over 150 separate computers with mechanisms to keep the data on all 150 synchronized and consistent. How do you preserve this multi-computer synchronization while also adding in the ability for the entire system to self-repair in the event of an emergency (like a hardware failure)?
Buckle up, this is going to be a long article - it took me a solid month to figure out the right process for making this work, hopefully it saves someone else effort in the future. Before I jump into the full setup details, some background on what I'm trying to do and why. I'm going to assume that if you're reading this, you have some fluency in SQL Server Management Concepts; if not, this is going to be a dense maze of technical jargon. This article describes an advanced technique, not something I'd recommend for SQL Server beginners.
Glossary of Server Names in our Stack:
- Primary server: "Source of Truth" used for writing data in the cloud. As of the time I write this, this server has between 1 and 10 active financial transactions per second (approximately $100 Million per year in incoming Mobile Money traffic).
- Replication: Mechanism that copies data from the Primary server to other places
- Distributor server: Does the main work of Replication (copying data around, networking, etc.)
- Merge Replication Laptop: Mostly-offline tiny slice of our database (see the technical intro for more detail)
- Transactional Replication Server: Online Read-Only copy of our database
- Gateway Server: Provides a public gateway that lets merge-replicated laptops talk to our main server
- AlwaysOn Failover Cluster: Keeps a backup "Source of Truth" on standby so that when (not if) the Primary server has a hardware problem or other unexpected outage, there's a backup ready to step in and take over.
- Active Directory: Windows Toolset to handle a large number of servers/users across a domain. Normally it would be someone's full-time job to handle this.
Before June 2020, this is how our server setup looked (simplified):
In March 2020, we started seeing our physical servers dropping like flies (they were all configured at roughly the same time and had roughly similar specifications; one by one they started showing hardware failures). That $100 Million server had regular (24-hour backups) and log-shipping, but considering that every second of downtime costs the organization $3 something more robust was needed. Between March 2020 and June 2020, I tested out and implemented the process that got our production stack to look like this instead:
Normally this kind of high-value financial system upgrade would be done by a team of dedicated database engineers, consultants and risk managers. We don't have those resources, so this project was done by:
- Two people (a single Sysadmin and a single DBA) - we decided early on that we didn't have the time to maintain active directory, so made use of workgroup failover clustering instead.
- Non-enterprise SQL license (really stretching the limits of what we can do without paying for an upgrade)
- Hetzner server host (like AWS/Azure but substantially cheaper and with less-advanced cloud tools)
This is a really complex process that frequently had me working weekends/cursing out obscure SQL Server/networking issues. I'll try to cover both the process that ended up working and some of the debugging steps/pitfalls I ran into along the way. So, without further ado, here's how you setup a SQL configuration with:
- Distributor separate from Primary server
- Merge Replication
- Transactional Replication
- AlwaysOn Failover Clustering in a Workgroup Configuration (No Active Directory)
- Server 2019
- SQL 2016
Step 1: Setup Windows Failover Clustering
We use hetzner server hosting for this - we used a paired physical server and virtual server cluster (see my notes at the end about the advantages/disadvantages of doing so) along with two virtual machines for a Distributor and Test Transactional Subscriber for a total of 4 servers (3 virtual, 1 physical). Once you have the Windows OS (Server 2019) configured, you'll want to:
1. Run Sysprep on both intended Primary servers to make sure that networking components are all working - our sysadmin sometimes copy/pastes server images, if you do this then the Sysprep command is needed for networking to configure properly.
2. Use the windows Server Manager to install failover clustering (lots of
blog posts cover the frontends that show how to do this)
3. Setup a total of three network interfaces. We're probably horribly abusing this system by just using VPNs, but with the project constraints it was far simpler to have two VPNs than to try something more complex (our attempts with VLANs never quite worked properly). We use OpenVPN (two distinct VPNs).
4. In theory you can now go ahead and use the Failover Clustering UIs to setup failover clustering and it just works (it seems so easy on all the other blog posts...). In reality, there are still a few error messages you're going to hit unless you make very specific targeted changes. One error that had me stuck for a week was this: