Wednesday, October 27, 2021

Cowboys vs. Soldiers

I classify developers into two main camps: Cowboys (aka Hackers) and Soldiers (aka Tweakers) - cowboys are great for startup environments where you need innovative, duct-tape solutions as quickly as possible. Soldiers are necessary for long-term scaling - they're interested in doing things the "Right Way", coming up with processes and completing tasks as assigned. I'm more in the first category (I tend to do a lot of duct tape fixes), but I know a number of developers in the second category - it's pretty amazing that when I ask them to do something and they just... do it. They do high quality work, they ask questions when appropriate, they learns quickly but they're not trying to introduce projects/drive their own agenda.

(Note: in casual discussion I've had some people point out that you can hybridize the two approaches - this is a spectrum, not a binary)

(Note 2: yes, I realize both of those jobs are male dominated and this reinforces problematic stereotypes about the tech world. Mentally replace "Cowgirl" or "Nonbinary ranch hand" wherever you see fit; I'm talking about archetypes, not individual people.)

The stereotypical Silicon Valley leadership model of tech says "Only Hire Rockstars and Cowboys", but I strongly disagree. Working with Soldiers is great. Unfortunately, the road to promotion and power often expects "Cowboy" skills - being visible in high-profile projects, driving your own agenda and pushing through changes. In an ideal world, we could just reward "leadership by example"; in many organizations this isn't really possible. Especially in a large organization, I can't honestly recommend for Soldiers to start acting like Cowboys - Cowboys often butt heads with each other (we all think we're the smartest person in the room. As the smartest person in the room, this is really annoying to me.)

Instead, I would encourage Soldiers to start thinking about different ways to drive change - not necessarily "screw you guys, I'm using my OWN programming language" (cowboy style) but a more diplomatic, soft-skill driven way of implementing changes within an organization. Leverage the fact that you're easy to work with as a way to start collaborating across a larger team. I don't know of any magic formula to do this, but emotional intelligence helps. Also, free booze (or free herbal tea for the non-drinking coworker).

Wednesday, September 9, 2020

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 5)

Following the other 4 posts, you should now have a working alwayson cluster with two primary databases and a distributor. Up to now, we've been mostly following the Windows recommendations; these last few steps deviate from the Windows guides in order to get proper networking. While there are other ways to do this, the simplest that our team found was:

1. Setup OpenVPN on all the servers in the cluster

2. Use DNS-MASQ to expose an internal API that each server can use to alter the VPN DNS

3. Use custom automation scripts to alter DNS when a failover happens on one of the primary servers.

Custom DNS Scripting

First, we're going to need a script (more accurately, two scripts) that can set the DNS when a failover happens. For testing and simplicity, I found this worked best with two powershell scripts (one on primary01, one on primary02). I also found that the failed-over database permissions seemed a bit unreliable right after the failover; some accounts were machine-specific so when the DB failed over, the new accounts would need to be recreated - those are included in these powershell scripts. While this script could run more often, it's more efficient to use a helper Database that tracks the current failover state/active primary server so the script only runs as necessary. This helper database has a separate copy on each failover instance.

First, create a helper database (here AlwaysOnHelper), then create the table:

Technology: Setting up a Merge Replication Gateway Server

Note: I didn't write this document, credit goes to Lucy Yao - but it's very useful information to have and is frequently helpful during server setup.

New Server Setup

Install SQL Server x64 (Standard – Server with a GUI)
Create Windows accounts

My_Replication_Merge
My_Replication_Dist
My_Replication_Log
My_Replication_Snapshot
My_SQL_Agent (Agent Service)
My_SQL_DBEngine (Server Service)
My_Admin account

Rename computer to SYNCGATEWAY
Install .NET 3.5

In Server Manager, click Manage (upper right), then click Add Roles and Features
Push Next until you reach the features page. Select .Net Framework 3.5 and click Next
On the confirmation page, click Specify Alternate Source Path. Type D:\sources\sxs at the bottom (if the ISO from 1a was mounted in drive D).
Complete the installation

Install IIS

In Server Manager, click Manage (upper right), then click Add Roles and Features
Click Next until you reach Server Roles, then select IIS. Click Add Features at the pop-up (if applicable).
At the Features page, expand .NET Framework 4.5 and select ASP.NET 4.5
Click Next until you reach Role Services. Select the following

Basic Authentication
Client Certificate Mapping Authentication
IIS Management Console
IIS 6 Management Compatibility

Install
In Server Manager, click Manage (upper right), then click Add Roles and Features
Click Next until you reach Server Roles
Select the following in the Application Development submenu

ISAPI Extensions
ASP.NET (3.5 and 4.5)
.NET Extendibility (3.5 and 4.5)

Install

Install SQL Server 2014

In Feature selection, choose

Database Engine Services
SQL Server Replication
Client Tools Connectivity
Management Tools –Basic
Management Tools –Complete

In Server Configuration, set the account name to

SSE_Agent for SQL Server Agent and set Startup Type to Automatic
SSE_DBEngine for SQL Server Database Engine and set Startup Type to Automatic

In Database Engine Configuration, click Add Current User
Install

Enable Mixed Mode Authentication (In SSMS right click SYNCGATEWAY, then properties. Go to the Security tab)
Add relevant user permissions in SQL (a custom SQL script helps here)
Give all users full permissions to the C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\repldata
Restore the database (MYDATABASE)
Add users to distribution database

Select Security then My_Replication_Merge on the left-hand pane. Right-click and select properties
In User Mapping, select distribution. In Default Schema, type in dbo
In the check box below, select db_owner
Repeat the process for My_Replication_Snapshot and My_Replication_Dist

IIS Setup – Replication Website

Certificate

Open IIS and click SYNCGATEWAY. Click Server Certificates
On the right-hand pane, select Create Self-Signed Certificate
Pick a name and follow the creation process

Replication Site

On the left-hand pane, under Sites, right-click Default Website and rename it Replication
With the Replication Website selected, click Bindings
Add a binding with Type: https. For SSL certificate, select the one created in 12a (Alternatively, use a non-self signed SSL certificate)
Remove the default http port 80 site

Installing the SQL Server Replication Listener

Create a new file directory for replisapi.dll on the computer that is running IIS. Create the directory under the C:\Inetpub\SQLReplication directory
Copy replisapi.dll from the directory C:\Program Files\Microsoft SQL Server\120\com\ to the file directory that you created in step 1.
Click Start, and then click Run. In the Open box, enter cmd, and then click OK.
In the directory created in step 1, execute the following command: regsvr32 replisapi.dll
In Internet Information Services (IIS) Manager, in the Connections pane, right-click Replication, and then select Add Virtual Directory.
For Alias, enter SQLReplication.
For Physical Path, enter C:\Inetpub\SQLReplication\, and then click OK.
In Internet Information Services (IIS) Manager, click Replication
In the center pane, click Handler Mappings.
In the Actions pane, click Add Module Mapping.
For Request Path, enter replisapi.dll.
From the Module drop-down list, select IsapiModule.
For Executable, select C:\Inetpub\SQLReplication\replisapi.dll
For Name, enter Replisapi.
Click the Request Restrictions button, click the Access tab, and then click Execute.
Click OK to close the Request Restrictions dialog box, and then click OK again to close the Add Module Mapping dialog box. When you are prompted to allow the ISAPI extension, click Yes to add the extension.
Verify that Replisapi.dll is listed under the Enabled handler mappings. If it is in the Disabled list, right-click the Replisapi entry and then click Edit Feature Permissions. Check the Execute box, and then click OK.

To Configure IIS Authentication

In Internet Information Services (IIS) Manager, click Replication.
In the middle pane, double-click Authentication.
Right-click Anonymous Authentication, and then choose Disable.
Right-click Basic Authentication, and then choose Enable.

To require SSL security for a Web site

In the middle pane, double-click SSL Settings.
Check the Require SSL option. Under Client certificates, verify that the Ignore button is selected.
Click Apply

To test the certificate

From the Actions pane, click Browse *:443(https).
Internet Explorer will open and display a message that "There is a problem with this website's security certificate." This warning tells you that the associated certificate was not issued by a recognized CA and might not be trustworthy. This is an expected warning, so click Continue to this website (not recommended).
If you are prompted to Connect to localhost, enter a user name and password to proceed. You should see the default page for the Web site. (The Windows administrator username and password).

Add My_Replication_Merge to the IIS_IUSRS group

In Server Manager, click Tools then expand Computer Management, expand Local Users and Groups, and then click Groups.
Right-click IIS_IUSRS, and then click Add to Group.
In the IIS_IUSRS Properties dialog box, click Add.
In the Select Users, Computers, or Groups dialog box, add SYNCGATEWAY/My_Replication_Merge.
In the Select Users dialog box and the IIS_IUSRS Properties dialog box, click OK.

Grant minimum account permissions on the folder that contains replisapi.dll

In Windows Explorer, right-click the folder that you created for replisapi.dll, and then click Properties.
On the Security tab, click Edit.
In the Permissions for <foldername> dialog box, add SYNCGATEWAY\My_Replication_Merge
Verify that the account is granted only Read, Read & Execute, and List Folder Contents permissions.

Create an application pool in Internet Information Services (IIS) Manager

In Internet Information Services (IIS) Manager, in the Connections pane, expand the local server node.
Right-click Application Pools, and then click Add Application Pool.
Enter Replications as the name for the application pool, leave the default values for the remaining fields, and then click OK.

Associate the account with the application pool

In Internet Information Services (IIS) Manager, expand the local server node, and then click on Application Pools.
Right-click the application pool that you created, and then click Set Application Pool Defaults.
In the Application Pool Defaults dialog box, scroll down to the Process Model section, and then click the Identity field.
Click the ellipsis button on the right side of the Identity row.
Click the Custom Account radio button, and then click Set.
In the User name and Password fields,enter My_Replication_Merge for the user name and the windows password as the password.
Click OK to close the Application Pool Identity dialog box, and then click OK again to close the Application Pool Defaults dialog box.

Associate the application pool with the replication Web site

In Internet Information Services (IIS) Manager, expand the local server node, and then click on the Replication website
In the Actions pane, under Manage Web Site, click Advanced Settings.
In the Advanced Settings dialog box, click on the ellipsis button to the right of Application Pool.
From the Application pool drop-down list, select Replications
Click OK again to close Advanced Settings.

To test the connection to replisapi.dll

In Microsoft Internet Explorer, on the Tools menu, click Internet Options.
On the Connections tab, click LAN Settings.
If a proxy server is not used on the LAN, clear Automatically Detect Settings and Use a proxy server for your LAN.
If a proxy server is used, click Use a proxy server for your LAN and Bypass proxy server for local addresses, and then click OK.
Go to https://(your URL)/SQLReplication/replisapi.dll?diag
Log in with the Administrator account
The page should look like this

You now have a working sync gateway! A separate post will cover how to sync to it from the merge subscriber.

Wednesday, September 2, 2020

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 4)

Set up custom SQL networking

In order for SQL to work, different servers need to be able to talk to each other; if you use nonstandard ports, the easiest way to manage this is to use SQL Alias configured in SQL Configuration Manager. The alias connections that will be needed are:

MyPRIMARY01 + MyPRIMARY02: will need aliases to the distributor and any transactional subscribers

MyDISTRIBUTOR: will need alias to the Primary Listener and any transactional subscribers

Transactional Subscriber(s): will need alias to the Primary Listener and Distributor

For safety, I usually make these connections as both 32-bit and 64-bit aliases.

Now for the workaround that makes the rest of this possible - after struggling with windows-internal DNS tools, we eventually gave up and made our own DNS solution. A combination of OpenVPN and DNS-MASQ makes it possible to have a dynamically-controlled DNS address that's handled through automation. This will show up as a DNS entry that's not publicly available (shown here as myprimary.vpn), controlled by an automation script. This custom dynamic DNS in combination with replication made it possible to put all the pieces together so that we could have a failover cluster that worked with both merge and transactional replication. One of the complexities involved is that SQL can only handle NETBIOS server names (not URLs), so we used aliases to hide this complexity from SQL SERVER:

Then, for replication to work you'll need to setup server linkages between the DISTRIBUTOR server and both PRIMARY servers.

Run this distributor + linkage creation script on the DISTRIBUTOR server:

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 3)

Ok, SQL configuration is relatively straightforward if you've already done the network/adapter/cluster configuration correctly. If you haven't, this involves a lot of debugging at different layers of the stack.

First off, it's REALLY important that you have identical SQL versions in use by both machines in the failover cluster ('select @@version' helps here). Make sure that you've installed identical versions of SQL Server. Also, make sure that the Windows account used to run the SQL Server service has identical credentials on both servers and is part of the Computer's administrator group (Services > 'SQL Server (MSSQLSERVER)' > 'Log on' > double check the account credentials are the same for both).

Once the SQL accounts/versions align, you'll need to configure the SQL Server instance to allow use within a failover cluster (on both servers). This is in the SQL Configuration Manager Tool > SQL Server Services > right-click the SQL Server (MSSQLSERVER) choice > go to "AlwaysOn High Availability" and check "Enable AlwaysOn Availability Groups", then "OK".

Do this on both servers, then restart the SQL Services on both. This will allow SQL Server to use the AlwaysOn components.

One step that you may or may not need (depending on the code you plan to run) is the "Distributed Transaction Coordinator" configuration. We make use of this for a script that creates indexed views in replication (I'll write this up separately, it's a custom script very specific to what we do at OAF), so for our configuration I'll need to enable this windows feature using the steps here.

Next, open up SSMS and log in. It makes the process easier if there's a shared Admin SQL account that exists on all 4 servers (MyPrimary01, MyPrimary02, MyDistributor and the Transactional Subscriber). Restore a backup of your target (or example) database - this will probably be replaced later, so for now I'd recommend using a minimized/seed backup. Once you have a restored backup, use the "New Availability Group Wizard" to setup the Availability group. Some notes:

Note that many of these SQL objects/failover cluster objects end up with virtual IPs or virtual URLs, which can cause conflicts. To prevent any possible naming conflicts (especially in cases where the failover cluster is using Virtual IPs/Virtual machine names) it's safest to make sure that none of these names are the same:

Availability Group Name
Server Names (for all servers in the cluster)
Failover Cluster Name
Listener Name

Full Recovery Mode is Required for whichever DB you are using - see notes in part 5 on customizations, this is necessary but can cause the transaction log to grow unsustainably until you set up log shipping
Under "Specify Replicas", with a standard SQL license you can have only two machines listed. Typically to get full use of the AlwaysOn feature you'll want Synchronous Commits with Automatic Failovers (there are some performance tuning/advanced features possible here). Skip the listener for this first setup

Ignore any errors about Windows Domain Logins - as long as the Windows account used to run the SQL service matches on both servers this should be fine.
Use Automatic Seeding (as long as the disk partition labels/sizes are the same on both machines this should be fine)
The next steps of the wizard are fairly self-explanatory - if everything works, you should see the cluster created.

So the availability group now exists, but can't really do much that's useful yet. A few different pieces are needed first - starting with a listener.

Listeners took me some time to figure out, a lot of the AlwaysOn documentation treats them like a load balancer (other server that points to the Primary node of the cluster), but they act more like a Status that is "owned" by the Primary node. This is where subnet masks come into play - you'll need to use a config script of them form:

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 2)

See Part 1 for details on setting up networking. After that, we still have a ways to go when setting up the servers:

Configure the Failover Cluster

After those registry and networking changes, you should be able to open up the Windows UI and try setting up a failover cluster (starting at step 26 from this blog post). If you are amazingly lucky, then all the tests will pass and things will work to create the cluster; if not, then you'll want to double-check the registry and networking settings on both machines. A good sanity check is to make sure that you can ping between the two Primary servers. Note that it doesn't matter which machine is used to configure the cluster, the idea is that after configuration both will be equal partners in the new cluster.

If you make a mistake with partial cluster configuration, then the Powershell Remove-Cluster and Remove-ClusterNode commands are really helpful to clear out all traces of the cluster and try again.

Really important note - this screen caused a lot of problems for me during the cluster configuration:

a lot of Active Directory internals expect that you'll be able to create Virtual IP addresses. This screen only allows you to use administration IPs only in cases where the subnet mask has some extra "IP space" - if the subnet mask for a network connection is 255.255.255.255 then you can't use it to administer the cluster. This can lead to some undesired security situations on public IPs (you don't know who owns the IPs close to yours and there's a risk of sending traffic to them), but on VPNs is ok to open up the subnet mask to 255.255.255.252 or larger.

In general, you'll need to give the failover cluster some imaginary virtual IPs from your VPN (a fake IP address within the subnet mask range) in order to get a SQL Listener working (described in part 3). There's almost certainly a better way to do this, I usually just do a combination of guessing, tweaking the subnet masks and trying out subnet mask calculators until I can find a subnet in the right range.

The important rule is:

Real IP Address: Use a real Subnet mask (255.255.255.255), anything else is a security risk.

VPN IP Address: Use a loose Subnet mask (255.255.255.252 or larger), otherwise the listener won't work later on.

Generally the tests that fail at this stage are related to "IP Configuration" - if those pass (but things like the firewall fail) then you're usually in good shape to create the cluster.

As for the cluster name - note that the Active Directory holdovers of this failover clustering expect every created "virtual object" to be distinct. This means that all of these must have different names:

MyPrimary01
MyPrimary02
MyDistributor
MyFailoverCluster
MySQLAvailabilityGroup
MySQLListener

Once you've created the cluster (the name must be unique/not overlap with a real server name), you should be able to see a few things - under "Nodes", you'll expect both machines to show up:

then, you'll want to check "Networks" and verify that ALL of your VPN adapter IPs are showing up here. If you're missing any one of the adapter IPs, weird things will happen - we'd expect 4 VPN IPs (+1 in my case for the public IP used to administer the cluster).

Note the "Cluster and Client" setting - generally this should be restricted to only allow clients on certain networks, with a VPN it's typically fine to allow "Cluster and Client" communication on all networks. This will be needed for the SQL Listener later.

So there now should be a failover cluster with accurate networks and nodes - on to the SQL configuration. See Part 3 for details on how to setup the SQL Configuration for the Failover Cluster.

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 1)

Translated into slightly-less-technical terms:

Our database already runs on over 150 separate computers with mechanisms to keep the data on all 150 synchronized and consistent. How do you preserve this multi-computer synchronization while also adding in the ability for the entire system to self-repair in the event of an emergency (like a hardware failure)?

Buckle up, this is going to be a long article - it took me a solid month to figure out the right process for making this work, hopefully it saves someone else effort in the future. Before I jump into the full setup details, some background on what I'm trying to do and why. I'm going to assume that if you're reading this, you have some fluency in SQL Server Management Concepts; if not, this is going to be a dense maze of technical jargon. This article describes an advanced technique, not something I'd recommend for SQL Server beginners.

Glossary of Server Names in our Stack:

Primary server: "Source of Truth" used for writing data in the cloud. As of the time I write this, this server has between 1 and 10 active financial transactions per second (approximately $100 Million per year in incoming Mobile Money traffic).
Replication: Mechanism that copies data from the Primary server to other places
Distributor server: Does the main work of Replication (copying data around, networking, etc.)
Merge Replication Laptop: Mostly-offline tiny slice of our database (see the technical intro for more detail)
Transactional Replication Server: Online Read-Only copy of our database
Gateway Server: Provides a public gateway that lets merge-replicated laptops talk to our main server
AlwaysOn Failover Cluster: Keeps a backup "Source of Truth" on standby so that when (not if) the Primary server has a hardware problem or other unexpected outage, there's a backup ready to step in and take over.
Active Directory: Windows Toolset to handle a large number of servers/users across a domain. Normally it would be someone's full-time job to handle this.

Before June 2020, this is how our server setup looked (simplified):

In March 2020, we started seeing our physical servers dropping like flies (they were all configured at roughly the same time and had roughly similar specifications; one by one they started showing hardware failures). That $100 Million server had regular (24-hour backups) and log-shipping, but considering that every second of downtime costs the organization $3 something more robust was needed. Between March 2020 and June 2020, I tested out and implemented the process that got our production stack to look like this instead:

Normally this kind of high-value financial system upgrade would be done by a team of dedicated database engineers, consultants and risk managers. We don't have those resources, so this project was done by:

Two people (a single Sysadmin and a single DBA) - we decided early on that we didn't have the time to maintain active directory, so made use of workgroup failover clustering instead.
Non-enterprise SQL license (really stretching the limits of what we can do without paying for an upgrade)
Hetzner server host (like AWS/Azure but substantially cheaper and with less-advanced cloud tools)

This is a really complex process that frequently had me working weekends/cursing out obscure SQL Server/networking issues. I'll try to cover both the process that ended up working and some of the debugging steps/pitfalls I ran into along the way. So, without further ado, here's how you setup a SQL configuration with:

Distributor separate from Primary server
Merge Replication
Transactional Replication
AlwaysOn Failover Clustering in a Workgroup Configuration (No Active Directory)
Server 2019
SQL 2016

Step 1: Setup Windows Failover Clustering

We use hetzner server hosting for this - we used a paired physical server and virtual server cluster (see my notes at the end about the advantages/disadvantages of doing so) along with two virtual machines for a Distributor and Test Transactional Subscriber for a total of 4 servers (3 virtual, 1 physical). Once you have the Windows OS (Server 2019) configured, you'll want to:

1. Run Sysprep on both intended Primary servers to make sure that networking components are all working - our sysadmin sometimes copy/pastes server images, if you do this then the Sysprep command is needed for networking to configure properly.

2. Use the windows Server Manager to install failover clustering (lots of blog posts cover the frontends that show how to do this)

3. Setup a total of three network interfaces. We're probably horribly abusing this system by just using VPNs, but with the project constraints it was far simpler to have two VPNs than to try something more complex (our attempts with VLANs never quite worked properly). We use OpenVPN (two distinct VPNs).

4. In theory you can now go ahead and use the Failover Clustering UIs to setup failover clustering and it just works (it seems so easy on all the other blog posts...). In reality, there are still a few error messages you're going to hit unless you make very specific targeted changes. One error that had me stuck for a week was this:

Server Farmin'

Wednesday, October 27, 2021

Cowboys vs. Soldiers

Wednesday, September 9, 2020

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 5)

Custom DNS Scripting

Tuesday, September 8, 2020

Technology: Setting up a Merge Replication Gateway Server

New Server Setup

Wednesday, September 2, 2020

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 4)

Set up custom SQL networking

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 3)

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 2)

Configure the Failover Cluster

Technology: Combining AlwaysOn Clusters with Merge/Transactional Replication outside of Active Directory (Part 1)

Step 1: Setup Windows Failover Clustering

Cowboys vs. Soldiers

Report Abuse

Labels