DNS is a great technology that everyone uses over internet. How would you reach a given website if you weren’t able to solve its name to the IP address? Would you memorize the public IP addresses of any website you want to reach? No, and with IPv6 coming in the future, DNS will become even more important for internet consumption. But DNS has one drawback: its records are usually static, and if a platform is dynamic and spawn/removes instances on the fly, it needs to have a way to modify the DNS records that are published, so that a non-reachable instances is not even listed.
The use case: Veeam Cloud Connect
I’ve started to look at these types of services when working on Veeam Cloud Connect. I wrote both the Reference Architecture for v8, and the new one for v9. One of the suggested way to publish Veeam Cloud Connect has always been DNS Round Robin: it’s a easy solution to publish multiple instances using the same DNS hostname, like in my lab:
gtw1.virtualtothecore.com 220.127.116.11 gtw2.virtualtothecore.com 18.104.22.168 gtw3.virtualtothecore.com 22.214.171.124
These three gateways can load balance between them without the need of any external load balancer, and they are all reachable with a common DNS hostname:
that is mapped to all the three addresses. The client component only needs to talk with one of the gateways, and from there on the software is able to switch to any of the gateways if needed. But what if I want to immediately remove a gateway that’s not reachable, like during a maintenance period, without having to manually edit and update my DNS configuration? And obviously re-add the same record once the gateway is up again?
Here’s where Health Checks in AWS Route 53 come into play.
AWS Route53 DNS health checks
All my domains and their dns zones are already hosted into AWS Route 53. I think it’s a really compelling dns service, and it’s so easy to edit dns zones. So, when I first built my Veeam Cloud Connect I created my new records into the virtualtothecore.com hosted zone:
The default routing policy to rotate the records is set to Simple: this is what’s usually known as DNS Round Robin. Round-robin DNS works by responding to DNS requests not only with a single potential IP address, but with one out of a list of potential IP addresses corresponding to several servers that host identical services. The order in which IP addresses from the list are returned is the basis for the term round robin. With each DNS response, the IP address sequence in the list is permuted. Usually, basic IP clients attempt connections with the first address returned from a DNS query, so that on different connection attempts, clients would receive service from different providers, thus distributing the overall load among servers.
The limit of round robin DNS is mainly record caching in the DNS hierarchy itself, as well as client-side address caching and reuse. This is why my records have a TTL (Time To Live) of 60 seconds. But we can do even better using Health Checks. With Health Checks, Amazon Route 53 checks the health of the resources and respond to DNS queries using only the healthy resources. For my scenario, I’m choosing an Active-active failover: this failover configuration is to be used when you want all of your resources to be available the majority of the time. When a resource becomes unavailable, Amazon Route 53 can detect that it’s unhealthy and stop including it when responding to queries.
Health Checks configuration
In the Health Checks area of Route 53, we select “Create health check”, and we configure it like this:
With this setup, the Veeam Cloud Gateway will be tested every 30 seconds, and after 3 failures (that is, 90 seconds) it will be considered unavailable. You can also receive an alarm via e-mail when the check is triggered:
We repeat the same configuration for the other two gateways, and as soon as the checks are configured, we can immediately trace information like health and latency of the monitored services:
Changing DNS configuration to use health checks and weighted resources
Now that we have the correct health checks in place, it’s time to modify our DNS configuration to use them. If you look again at the DNS configuration I captured in the first screenshot, a single A record is holding all the three IP addresses. In order to switch to the new configuration, we need to split those IPs over thre different records, and remove the previous one. The final result of each new record will be like this:
Let’s see what are the selected options:
– the record is a type A, and the nam for each is cc.virtualtothecore.com. As we will have one record for each IP, nothing changes from what we had before
– TTL is again very low, 60 seconds. In this way the cache at the client side will always be clean and when a failover operation will occur, the time needed to switch from one record to another will be short
– Value is the IP used on each gateway, one for each new record set
– Routing policy: the previous policy was set to “Simple”, which is the default policy. But you cannot create multiple records with the same name using the simple policy. So, we switch to the Weighted policy: each record has a value of 10, and the sum of them all is 30. Each record will be passed to a client doing a dns resolution with a frequency determined by the weight. Since each of them is 10, the percentage will be 10 / (10*3) = 33%. So, exactly as before, DNS round robin will cycle through the three records with no special preference for any.
– Set ID is only a description of the record. I’ve just wrote in this field the name of the cloud gateway associated with the public IP listed in the record
– Health Check: here I linked each record set with the corresponding Health Check.
Health checks in action
From now on, Health Checks verify each 30 seconds that a cloud gateway can be reached over TCP port 6180. After 3 failed attempts, the record will be disabled, and this will be done automatically by Route53 by setting the weight of the record to 0. The overall failover time should be as short as 150 seconds (2,5 minutes): 90 seconds for the 3 failed health checks to happen, plus a worst case of a client dns cache that was just updated before the last check and it would take additional 60 seconds to expire the TTL of the record. These values can be lowered even more by using fast checks (each 10 seconds).
Let’s do some tests. Under normal conditions, if I check the resolution of cc.virtualtothecore.com from my computer I obtain: