Translated into slightly-less-technical terms:
Our database already runs on over 150 separate computers with mechanisms to keep the data on all 150 synchronized and consistent. How do you preserve this multi-computer synchronization while also adding in the ability for the entire system to self-repair in the event of an emergency (like a hardware failure)?
Buckle up, this is going to be a long article - it took me a solid month to figure out the right process for making this work, hopefully it saves someone else effort in the future. Before I jump into the full setup details, some background on what I'm trying to do and why. I'm going to assume that if you're reading this, you have some fluency in SQL Server Management Concepts; if not, this is going to be a dense maze of technical jargon. This article describes an advanced technique, not something I'd recommend for SQL Server beginners.
Glossary of Server Names in our Stack:
- Primary server: "Source of Truth" used for writing data in the cloud. As of the time I write this, this server has between 1 and 10 active financial transactions per second (approximately $100 Million per year in incoming Mobile Money traffic).
- Replication: Mechanism that copies data from the Primary server to other places
- Distributor server: Does the main work of Replication (copying data around, networking, etc.)
- Merge Replication Laptop: Mostly-offline tiny slice of our database (see the technical intro for more detail)
- Transactional Replication Server: Online Read-Only copy of our database
- Gateway Server: Provides a public gateway that lets merge-replicated laptops talk to our main server
- AlwaysOn Failover Cluster: Keeps a backup "Source of Truth" on standby so that when (not if) the Primary server has a hardware problem or other unexpected outage, there's a backup ready to step in and take over.
- Active Directory: Windows Toolset to handle a large number of servers/users across a domain. Normally it would be someone's full-time job to handle this.
Before June 2020, this is how our server setup looked (simplified):
In March 2020, we started seeing our physical servers dropping like flies (they were all configured at roughly the same time and had roughly similar specifications; one by one they started showing hardware failures). That $100 Million server had regular (24-hour backups) and log-shipping, but considering that every second of downtime costs the organization $3 something more robust was needed. Between March 2020 and June 2020, I tested out and implemented the process that got our production stack to look like this instead:
Normally this kind of high-value financial system upgrade would be done by a team of dedicated database engineers, consultants and risk managers. We don't have those resources, so this project was done by:
- Two people (a single Sysadmin and a single DBA) - we decided early on that we didn't have the time to maintain active directory, so made use of workgroup failover clustering instead.
- Non-enterprise SQL license (really stretching the limits of what we can do without paying for an upgrade)
- Hetzner server host (like AWS/Azure but substantially cheaper and with less-advanced cloud tools)
This is a really complex process that frequently had me working weekends/cursing out obscure SQL Server/networking issues. I'll try to cover both the process that ended up working and some of the debugging steps/pitfalls I ran into along the way. So, without further ado, here's how you setup a SQL configuration with:
- Distributor separate from Primary server
- Merge Replication
- Transactional Replication
- AlwaysOn Failover Clustering in a Workgroup Configuration (No Active Directory)
- Server 2019
- SQL 2016
Step 1: Setup Windows Failover Clustering
We use hetzner server hosting for this - we used a paired physical server and virtual server cluster (see my notes at the end about the advantages/disadvantages of doing so) along with two virtual machines for a Distributor and Test Transactional Subscriber for a total of 4 servers (3 virtual, 1 physical). Once you have the Windows OS (Server 2019) configured, you'll want to:
1. Run Sysprep on both intended Primary servers to make sure that networking components are all working - our sysadmin sometimes copy/pastes server images, if you do this then the Sysprep command is needed for networking to configure properly.
2. Use the windows Server Manager to install failover clustering (lots of
blog posts cover the frontends that show how to do this)
3. Setup a total of three network interfaces. We're probably horribly abusing this system by just using VPNs, but with the project constraints it was far simpler to have two VPNs than to try something more complex (our attempts with VLANs never quite worked properly). We use OpenVPN (two distinct VPNs).
4. In theory you can now go ahead and use the Failover Clustering UIs to setup failover clustering and it just works (it seems so easy on all the other blog posts...). In reality, there are still a few error messages you're going to hit unless you make very specific targeted changes. One error that had me stuck for a week was this:
Error while running test Validate IP Configuration - System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.ApplicationException: There was an error verifying the IP configuration. ---> System.ArgumentException: The parameter must be non-null and non-empty.
In order to fix that error, the step that really helped was to go into the Windows Event Viewer and find the FailoverClustering-Manager Diagnostics Log:
(Event Viewer (Local) -> Applications and Services Logs > Microsoft > Windows > Failover-Clustering Manager > right-click the Diagnostic option and choose "Enable Log")
Enable the log on both primary servers. Enabling that Event Viewer Log highlights exactly which parameter must be non-null and non-empty - it's an obscure and undocumented TcpIp registry setting that is apparently ABSOLUTELY VITAL to the whole failover cluster process:
Super-important Registry Setting that has to be configured on both Primary servers:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
“SearchList”=”mydomain.org”
(or some other DNS)
5. While you're making registry changes, there's another really-important registry setting needed for workgroup configuration: localaccounttokenfilterpolicy (described
here):
New-ItemProperty -Path HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System -Name LocalAccountTokenFilterPolicy -Value 1
6. Next, you'll need to go into the detailed network settings and make some absurdly-specific changes. These are the settings that worked for our 2-person team, if you have a 20-person team then there's probably a better way to do this. What worked for us:
DNS settings for both servers - under advanced adapter settings for IPv4, make sure that the ethernet public info is static, register in DNS is disabled and “mydomain.org” is in the search list (the earlier registry setting should have done this). Disable IPv6 everywhere.
These menus are buried in windows settings - you'll need to do:
Control Panel > Network and Internet > Network and Sharing Center > Change adapter Settings > for all "plugged in" network connections (OpenVPN creates a few that aren't connected) > Right click, choose Properties, then:-uncheck "Internet Protocol Version 6 (TCP/IPv6)"
-select click "Internet Protocol Version 4 (TCP/IPv4)", click Properties
in that menu:
-make sure you have a static IP address configured and fixed DNS - "Use the following IP Address" and "Use the following DNS server addresses" (this is needed for your main public adapter, but may not be necessary for VPN connections), then press Advanced
in the advanced menu:
-verify the IP address/default gateway (shouldn't be any changes needed)
-press "DNS", then in that menu:
-verify the DNS IPs (shouldn't be any changes needed)
-verify that your domain shows up under "Append these DNS suffixes (in order)" - this should match the registry setting from earlier
-uncheck "Register the Connections' Address in DNS"
(note - this is a really important step for Failover IP assignment, there are probably other ways to do it but I'll walk through our solution later)
-Press "OK" on everything, then repeat this same setup for all the active connections (you should have at least 3). I also disabled IPv6 on all unplugged connections, which may or may not be necessary.
Windows Networking Menus - amazingly intuitive:
Note that the servers you're configuring must have valid URLS with a public DNS registrar where the machine name + DNS suffix matches the URL (so for a machine with name myprimary01 and DNS suffix .mydomain.org, the real URL must be registered at myprimary01.mydomain.org).
So, once you've made this set of changes to all the active network adapters (on both primary servers), we can finally move on to the failover cluster configuration.
No comments:
Post a Comment