Following the other 4 posts, you should now have a working alwayson cluster with two primary databases and a distributor. Up to now, we've been mostly following the Windows recommendations; these last few steps deviate from the Windows guides in order to get proper networking. While there are other ways to do this, the simplest that our team found was:
1. Setup OpenVPN on all the servers in the cluster
2. Use DNS-MASQ to expose an internal API that each server can use to alter the VPN DNS
3. Use custom automation scripts to alter DNS when a failover happens on one of the primary servers.
Custom DNS Scripting
First, we're going to need a script (more accurately, two scripts) that can set the DNS when a failover happens. For testing and simplicity, I found this worked best with two powershell scripts (one on primary01, one on primary02). I also found that the failed-over database permissions seemed a bit unreliable right after the failover; some accounts were machine-specific so when the DB failed over, the new accounts would need to be recreated - those are included in these powershell scripts. While this script could run more often, it's more efficient to use a helper Database that tracks the current failover state/active primary server so the script only runs as necessary. This helper database has a separate copy on each failover instance.
First, create a helper database (here AlwaysOnHelper), then create the table:
CREATE TABLE [dbo].[MostRecentAlwaysOnRole](
[AlwaysOnGroup] [nchar](60) NULL,
[MostRecentRole] [nchar](60) NULL
) ON [PRIMARY]
GO
Next, create the stored procedure to track whether the database has just failed over or not:
USE [AlwaysOnHelper]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author: <Racette, Louis>
-- Create date: <2020-06-22>
-- Description: <Check if this server just failed over, if so modify permissions (basic edge detector).>
-- =============================================
CREATE PROCEDURE [dbo].[sp_CheckIfJustFailedOver] @AlwaysOnGroupName VARCHAR(MAX)
AS
BEGIN
DECLARE @ServerName NVARCHAR(256) = @@SERVERNAME
DECLARE @CurrentRole NVARCHAR(60)
DECLARE @AGName NVARCHAR(256) = @AlwaysOnGroupName
DECLARE @MostRecentRole NVARCHAR(60)
DECLARE @JustBecamePrimary NVARCHAR(60)
SELECT @CurrentRole = rs.role_desc
FROM sys.dm_hadr_availability_replica_states rs
JOIN sys.availability_replicas r
ON rs.replica_id=r.replica_id
JOIN sys.availability_groups ag
ON rs.group_id=ag.group_id
WHERE r.replica_server_name = @ServerName
AND ag.name = @AGName
SELECT @MostRecentRole = MostRecentRole from AlwaysOnHelper.dbo.MostRecentAlwaysOnRole where AlwaysOnGroup = @AGName
if (@MostRecentRole IS NULL)
BEGIN
insert into MostRecentAlwaysOnRole values(@AGName,'SECONDARY')
END
IF (@MostRecentRole != @CurrentRole and @CurrentRole = 'PRIMARY')
BEGIN
select @JustBecamePrimary = 'True'
END
ELSE
BEGIN
select @JustBecamePrimary = 'False'
END
update AlwaysOnHelper.dbo.MostRecentAlwaysOnRole set MostRecentRole = @CurrentRole where AlwaysOnGroup = @AGName
select @JustBecamePrimary as JustBecamePrimary
RETURN
END
Once you have that stored procedure, you'll want to repeat these steps on each AlwaysOn primary server. Then, create a powershell script on each primary server like this one:
Powershell script for primary01:
$currentPrimaryDBStatus = Invoke-Sqlcmd -Query @"
exec AlwaysOnHelper.dbo.sp_CheckIfJustFailedOver 'MYAVAILGROUP'
"@
$headers = @{
'X-Auth-Token' = 'MY_DNS_MASQ_API_KEY'
}
echo $currentPrimaryStatus
if ($currentPrimaryStatus.JustBecamePrimary -eq 'True') {
echo 'primary01 just became the main server, altering user permissions'
Invoke-Sqlcmd -Query @"
use MYDATABASE
DROP USER [Pub_Admin]
CREATE USER [Pub_Admin] FOR LOGIN [Pub_Admin]
ALTER USER [Pub_Admin] WITH DEFAULT_SCHEMA=[dbo]
ALTER ROLE [db_owner] ADD MEMBER [Pub_Admin]
DROP USER [Replication_Merge]
CREATE USER [Replication_Merge] FOR LOGIN [Replication_Merge]
ALTER USER [Replication_Merge] WITH DEFAULT_SCHEMA=[dbo]
ALTER ROLE [db_owner] ADD MEMBER [Replication_Merge]
use MyDatabase
EXEC sp_changedbowner 'Your_SQL_Admin'
"@
if ($currentPrimaryDBStatus.JustBecamePrimary -eq 'True') {
echo 'primary01 is the main server, altering DNS'
Invoke-RestMethod -Headers $headers -Method DELETE -Uri http://192.168.128.1/dnsmasq-rest-api/zones/myzone/(PRIMARY01-IP-ADDRESS)/myprimary.vpn
Invoke-RestMethod -Headers $headers -Method DELETE -Uri http://192.168.128.1/dnsmasq-rest-api/zones/myzone/(PRIMARY02-IP-ADDRESS)/myprimary.vpn
Invoke-RestMethod -Headers $headers -Method POST -Uri http://192.168.128.1/dnsmasq-rest-api/zones/myzone/(PRIMARY01-IP-ADDRESS)/myprimary.vpn
Invoke-RestMethod -Headers $headers -Method POST -Uri http://192.168.128.1/dnsmasq-rest-api/reload
}
Note that PRIMARY01-IP-ADDRESS and PRIMARY02-IP-ADDRESS would be the public IP addresses of these servers. They could also be setup with VPN IPs, but this introduces unnecessary performance bottlenecks when compared to the public IP.
The complementary script for PRIMARY02 does essentially the same thing, but the POST request that creates "myprimary.vpn" will create it for PRIMARY02-IP-ADDRESS. This means that the URL "myprimary.vpn" will always resolve to the active primary server during failovers; DNS-Masq is pretty fast and the DNS resolution time for a small VPN is tiny, so we've typically seen less than 10 second failover response times with this setup.
You'll notice that the Invoke-SQLCmd script doesn't have a password - that's because it will be called from within a different SQL job. That job (defined on MyPrimary01 here, but there would also be a copy at MyPrimary02) looks like:
BEGIN TRANSACTION
DECLARE @ReturnCode INT
SELECT @ReturnCode = 0
IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]' AND category_class=1)
BEGIN
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[Uncategorized (Local)]'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
END
DECLARE @jobId BINARY(16)
EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N'FailoverMyPrimary01',
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=0,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N'No description available.',
@category_name=N'[Uncategorized (Local)]',
@owner_login_name=N'MYPRIMARY01\MYADMINUSER', @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
/****** Object: Step [FailoverPrimary01] Script Date: 9/9/2020 3:52:22 PM ******/
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'FailoverPrimary01',
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N'CmdExec',
@command=N'powershell.exe C:\FailoverScripts\FailoverPRIMARY01.ps1',
@flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO
There's quite a bit of boilerplate there, but the key line is the @command where the powershell script actually gets invoked. There's one final piece needed to run this job - an alert on AlwaysOn Failover activity:
USE [msdb]
GO
EXEC msdb.dbo.sp_add_alert @name=N'FailoverDNS',
@message_id=1480,
@severity=0,
@enabled=1,
@delay_between_responses=0,
@include_event_description_in=0,
@category_name=N'[Uncategorized]',
@job_id=N'e9292086-7fc7-47ab-96cf-4476beafc48f'
GO
It's a bit hard to tell from the @job_id field directly, but if you use the "Properties" feature to view this job you'll see that it's set to trigger on alert 1480 (an internal SQL alert for when a failover occurs) and run the job "FailoverMyPrimary01".
As a final checklist, you would expect that both primary01 and primary02 should have:
1. Identical "FailoverDNS" alerts for message_id 1480
2. Nearly-Identical "FailoverMyPrimary" jobs, the only difference would be the powershell script that gets called
3. Identical "AlwaysOnHelper" databases with the table and stored procedure described above
4. Nearly identical Powershell scripts for setting DB user permissions and altering DNS entries.
Now that we have all those layers set up, the behavior that we'll get is:
(In the event that Primary01 is currently active, Primary01 IP is 8.8.8.5 and Primary02 IP is 8.8.8.4)
0. Replication works normally to Primary01 IP (myprimary.vpn points to 8.8.8.5)
1. Primary01 either goes offline (hardware failure, shutdown, SQL Service restart, etc.) or has a planned failover
2. The FailoverDNS alert runs on primary01 (if it's still possible to do) and primary02
3. Both Primary01 and Primary02 check to see if they just became the primary server; primary02 would see that it just went from SECONDARY to PRIMARY
4. Primary02 sends a POST request to the VPN server that alters the value of myprimary.vpn to 8.8.8.6.
5. Replication continues to work as normal (subscribers don't notice that the IP address for the URL myprimary.vpn has changed), possibly with interruption for any transactions that were running at the time
I wouldn't call it elegant, but this set of custom scripts and DNS-MASQ/OpenVPN tooling worked better for us than the built-in windows tools (without requiring Active Directory).
Putting it all Together/Testing
In order to test the full power of this approach, you'll need at least four servers:
Two primary servers
A Distributor
A combination transactional subscriber and Sync Gateway
Once you've configured these as appropriate, you should be able to failover between primary servers without disrupting replication for more than a few seconds.
Of course, this almost never works on the first try. Some common problems:
Problem: I created an availability group, but one or more replicas isn't showing as "joined" to the group.
Solution: There are a wide variety of things that can do this; the best way to get detailed information is to right-click the Availability group name, bring up the "Always On Dashboard" and navigate to "View Always On Health Events". Generally the debugging steps I would recommend following for this are:
1. Check that you can ping between both primary servers
2. Check the SQL aliases (32 and 64 bit) for both primary servers
3. Check that you can open an SSMS connection across the network to both servers (from myprimary01 to myprimary02 and vice-versa)
4. Make sure that you can login to SQL using all the relevant accounts (especially the System account). I've noticed that mixed windows-SQL accounts don't migrate well; changing the password of a system account or recreating the windows account on a different machine may require dropping + recreating the SQL account.
Problem: When I try to run the replication snapshot job, I see this error message
Validation failed for the publisher 'MYLISTENER' with error 21879 severity 16 message 'Unable to query the redirected server 'MYLISTENER' for original publisher 'MYPRIMARY01' and publisher database 'MYDB' to determine the name of the remote server; Error 18452, Error message 'Error 18452, Level 14, State 1, Message: Login failed. The login is from an untrusted domain and cannot be used with Windows authentication.'.'.
Solution: This error message indicates that a windows login won't work with the DNS redirection; the best way to debug is to try to determine which account this means. A useful debugging tool is to navigate to the COM directory of SQL Server (usually C:\Program Files\Microsoft SQL Server\130\COM but the number varies by version) and try running the replication tools manually with a CMD line. Trying different parameters for user logins (such as this guide to the snapshot tool) can help to isolate the issue; in general a few things that can help are:
1. Try the Registry setting for TCP Search list (Described in part 1) on the distributor and restart
2. Make sure the windows accounts have the right passwords (remember that duplicating server images can invalidate windows accounts in SQL Server, I've seen some weird behavior here - it can be safest to delete and recreate all windows accounts in SQL Server). Windows accounts also need read/write access to system directories in order for replication to work properly - see the replication security model for more detail.
3. Check that the SQL Server Service account matches between the distributor server and primary servers (same name/password). This can be relevant if you are using the replication tools without proxy settings.
4. Depending on your desired security settings, you may need to enable/disable the SQL Replication job proxy. On the distributor, go to SQL Server Agent > Jobs > right click your replication snapshot job (usually "MyPrimary01-MyDB-MyReplicationName-#" or something similar) then edit the job properties, go to "steps" > edit the "Run Agent" step (step 2) > alter the "Run as" setting. If this proxy isn't working, one way to debug/try different settings is to directly modify the snapshot setting string (starts with "-Publisher" in the screenshot below). You'll have to harcode passwords in plaintext to do so, which is not a good security practice - don't do that on production. In the past I've used this string to bypass windows logins entirely, solving the above error (but again, not something you want to do in production).
5. I've also had some past success with this error message by using the "sp_changeowner" command on MyDB (make sure this is a known admin account).
Problem: I setup a sync gateway, but can't understand the error messages it's giving.
Solution: Debugging merge replication syncing is quite difficult; the error messages are often quite precise but very jargon-heavy. Some common error messages:
Notes:
Physical vs. Virtual Servers
I had originally envisioned this to make use of one physical server (for performance) with a failover to a virtual machine (for reliability). While this configuration is possible, test failovers to the VM showed some pretty serious performance impacts; I'd recommend using two physical servers for this failover, otherwise the failover state is likely to cause unexpected performance issues.
Log Shipping and Transaction Log Growth
A few days after putting this live on production, I found that the "Full" backup option was causing my transaction log to grow unsustainably. Setting up log shipping to take regular backups of the transaction logs seems to have fixed the issue; there are probably alternative solutions but I found that the easiest.
No comments:
Post a Comment