Preparing for AWS Aurora Postgres failover

Manish Singh
3 min readFeb 5, 2021
Aurora Postgres with Writer and one Reader

Overview

In example shown above we have an Aurora Postgres cluster with a Writer and a Reader instance. In this post we will understand how Aurora failover works and what changes are needed in Application for smooth failover and predictable downtime. When we first tested the DB failover our application had a downtime of couple of hours because some connections kept pointing to previous writer instance.

Cluster endpoints

As shown above Aurora gives a Writer and a Reader endpoint to connect to the DB cluster. IP resolution for both these endpoints are same. We can understand it by doing it for the writer endpoint.

As shown above, DB Cluster writer endpoint is actually a CNAME which resolves to IP address of the current writer instance. After failover, running Dig on same writer endpoint will give different IP address. This is the main issue which we have to handle in our Application layer.

DNS cache on App side

TTL of DB cluster writer instance DNS resolution as set by AWS is 5 secs. Application layer might be caching DNS resolution for much longer. Because of this Application will not get new writer instance IP address and it will keep pointing to the older IP address.

In java this can be set as

java.security.Security.setProperty(“networkaddress.cache.ttl” , “1”);
java.security.Security.setProperty(“networkaddress.cache.negative.ttl” , “3”);

Doing this ensures that in case of failover Application layer is aware of new IP address of the writer instance.

Connection pool changes

There might be active connections in the connection pool pointing to the previous writer instance when the DB failover happened. These older connections will keep pointing to the older instances leading to errors like

UnknownHostException — If the previous writer instance is not reachable. This can happen in case of writer instance termination.

SQLException — If write operation is performed against the previous writer instance which has become readonly now. I have seen this happening in case of manual failover.

We were using hikari connection pool. Utilised its connection max-lifetime and idle-timeout settings to ensure that connection pool will get new connections with new writer instance within a predictable time after failover.

Above configs are in milli seconds.

Performance impact

After doing any change to DNS caching or connection pool settings team should do through functional and perf testing to verify the change.

Further work

There can be other ways to phase out older connections. When we explore them we will share.

  1. Programatically Detecting the IP change for DB instance and closing all connections in the pool
  2. Using a validation query (checking for the read/write permission as required) before returning connection to the calling thread. Simply checking a read query is not enough because of writer becoming reader scenario.

References

--

--